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Filters as a Language Support for Design Patterns in 
Object-Oriented Scripting Languages 


Gustaf Neumann and Uwe Zdun 
Information Systems and Software Techniques 
University of Essen, Germany 
{gustaf.neumann,uwe.zdun } @uni-essen.de 


Abstract 


Scripting languages are designed for glueing soft- 
ware components together. Such languages pro- 
vide features like dynamic extensibility and dynamic 
typing with automatic conversion that make them 
well suited for rapid application development. Al- 
though these features entail runtime penalties, mod- 
ern CPUs are fast enough to execute even large ap- 
plications in scripting languages efficiently. 


Large applications typically entail complex pro- 
gram structures. Object-orientation offers the 
means to solve some of the problems caused by this 
complexity, but focuses only on entities up to the 
size of a single class. The object-oriented design 
community proposes design patterns as a solution 
for complex interactions that are poorly supported 
by current object-oriented programming languages. 
In order to use patterns in an application, their im- 
plementation has to be scattered over several classes. 
This fact makes patterns hard to locate in the ac- 
tual code and complicates their maintenance in an 
application. 


This paper presents a general approach to com- 
bine the ideas of scripting and object-orientation in 
a way that preserves the benefits of both of them. 
It describes the object-oriented scripting language 
XOTcti (Extended OTct), which is equipped with 
several language functionalities that help in the im- 
plementation of design patterns. We introduce the 
filter approach which provides a novel, intuitive, and 
powerful language support for the instantiation of 
large program structures like design patterns. 


1 Introduction 


1.1 Scripting Languages 


In applications, where the emphasis lays on the 
flexible reuse of components, scripting languages, 
like Tct (Tool Command Language [25]), are very 
useful for a fast and high-quality development of 
software. The application development in scripting 
languages differs fundamentally from the develop- 
ment in systems programming languages [26] (like 
C, C++ or Java), where the whole system is de- 
veloped in a single language. A scripting language 
follows a two-level approach, distinguishing between 
components (reusable software modules) and glue- 
ing code, which is used to combine the components 
according to the application needs. This two level 
approach leads to a rapid application development 
[26]. 


Scripting languages are typically interpreted and 
use a dynamic type system with automatic conver- 
sion. The application developer uses a single data 
type (strings) for the representation of all data. 
Therefore, the interfaces of all components fit to- 
gether automatically and the components can be 
reused in unpredicted situations without change. 
The disadvantages of scripting languages are a loss in 
efficiency (e.g. for dynamic conversions and method 
lookup) and the lack of reliability properties of a 
static type system [18]. But these disadvantages can 
be compensated to a certain degree: 


e For several application tasks, the loss of effi- 
ciency is not necessarily relevant, because the 
time critical code can be placed into compo- 
nents written in efficient systems programming 
languages. Only the code to control these com- 
ponents is kept in the highly flexible scripting 
language. 
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e Since the components are typically written in 
a language with a static type system the relia- 
bility argument applies only on the glue code, 
used to combine the components. To address 
these remaining problems we have integrated 
an assertion concept based on pre- and post- 
conditions and invariants (see [24]). 


Since Tcu is designed for glueing components to- 
gether, it is equipped with appropriate functionali- 
ties, such as dynamic typing, dynamic extensibility 
and read/write introspection. Many object-oriented 
Tc.-extensions do not support well these abilities in 
their language constructs. They integrate foreign 
concepts and syntactic elements (mostly adopted 
from C++) into Tox (see e.g. [15, 9]). Even less 
appropriate is the encouraged programming style in 
structured blocks and the corresponding rigid class 
concept, which sees classes as write-once, unchange- 
able templates for their instances. This concept is 
not compatible with the highly dynamic properties 
of the underlying scripting language. 


1.2 Scripting and Object Orientation 


The three most important benefits of object- 
orientation are encapsulation of data and operations, 
code reuse through inheritance, and polymorphism. 
These should help to reduce development time, to 
increase software reuse, to ease the maintenance of 
software and to solve many other problems. 


But these claims are not undoubted: For example 
Hatton [11] argues that the non-locality problems 
of inheritance and polymorphism in languages, like 
C++, do not match the model of the human mind 
well. 


Encapsulation lets us think about an object in iso- 
lation; this is related to the notion of manipulating 
something in short-term memory exclusively. There- 
fore, encapsulation fits the human reasoning. Since 
in scripting languages a form of code reuse is already 
provided through reusable components, the foremost 
reason for the use of object-orientation in a scripting 
language is the encapsulation. For that reason, the 
inheritance problem also seems less conflicting, be- 
cause the inheritance is mainly used to structure the 
system and to put the components together prop- 
erly. Inheritance in scripting applications normally 
does not lead to large and complex classes that are 
strongly dependent on each other. 


Hatton [11] criticizes the polymorphism in C++ 
as damaging, because objects become more difficult 
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to manipulate through the evolving non-locality in 
the structures. They involve a pattern-like matching 
of similar behavior in long-term memory. The string 
as an uniform and flexible interface instead of the 
use of polymorphism makes the objects easier to be 
put together. They get one unique behavior and one 
unique interface. They may be used in different situ- 
ations differently, but the required knowledge about 
the object remains the same. 


These arguments account for the glueing idea of 
the scripting language in the scope of a single class 
and its environment. This scope of a “program- 
ming in the small” is the strength of current object- 
oriented (language) concepts. Their weakness is the 
“programming in the large”, where all components 
of a system have to be configured properly. The 
concepts only provide a small set of functionalities 
that work on structures larger than single classes, 
e.g. from languages like Java or C++ the following 
are known: 


e virtual properties are used to define additional 
object- and class-properties, 


e abstract classes specify formal interfaces and re- 
quirements for a set of classes, 


e parametric class definitions are used for differ- 
ent data types on one class-layout. 


Beneath such language constructs, methodical ap- 
proaches, like frameworks, exist. Since they are 
coded using conventional language constructs the 
problems due to the language insufficiencies are not 
eliminated. The main insufficiency is that classes 
and objects are relatively small system-parts com- 
pared to an entire, complex system. Therefore, the 
wish for a language construct, which maps such a 
large structure to an instantiable entity of the pro- 
gramming language, arises. 


1.3. OTcl — MIT Object Tcl 


We believe OTct [32] is an extraordinary object- 
oriented scripting language which supports several 
features for handling complexity. It preserves and 
extends the properties of Tct like introspection and 
dynamic extensibility. Therefore, we used OTcu as 
the starting point for the development of XOTct. 


In OTcu each object is associated with a class. 
Classes are ordered by the superclass relationship 
in a directed acyclic graph. The root of the class 
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hierarchy is the class Object that contains the meth- 
ods available in all instances. A single object can 
be instantiated directly from this class. In OTcL 
classes are special objects with the purpose of cre- 
ating and managing other objects. Classes can be 
created and destroyed dynamically like regular ob- 
jects. Classes contain a repository of instance meth- 
ods (“instprocs” ) for the associated objects and pro- 
vide a superclass relationship that supports multiple 
inheritance. 


Since a class is a special (managing) kind of object, 
it is managed by a special class called “meta-class” 
(which manages itself). The meta-class is used to 
create and to instantiate ordinary classes. By modi- 
fying meta-classes it is possible to change the behay- 
ior of the derived classes widely. All inter-object and 
inter-class relationships are completely dynamic and 
can be changed at arbitrary times with immediate 
effect. 


2 Design Patterns in Scripting Lan- 
guages 


The complexity of many large applications is 
caused by the combination of numerous, often in- 
dependently developed components, which have to 
work in concert. Typically, many classes are in- 
volved, with different kinds of non-trivial relation- 
ships, like inheritance, associations, and aggrega- 
tions. Design patterns provide abstractions over 
reusable designs, that can typically be found in the 
“hot spots” [28] of software architectures. Patterns 
are designed to manage complexity by merging in- 
terdependent structures into one (abstract) design 
entity. 

Design patters are considered increasingly often as 
reusable solutions for general problems. Specialized 
instances of design patterns can be used in a diver- 
sity of applications. Soukup [30] defines a pattern 
as follows: 


“A pattern describes a situation in which sev- 
eral classes cooperate on a certain task and 
form a specific organization and communica- 
tion pattern.” 


Design patterns are collected in pattern catalogs 
[10, 6]. Typically, these catalogs contain general 
patterns, but there are also catalogs which collect 
domain specific patterns. In this paper, we see a de- 
sign pattern as an abstract entity with normative, 
constructive and descriptive properties, that is iden- 
tified in the design process and has to be preserved 


(with documentation and usage constraints) in the 
implementation as well. 


2.1 Language Support for Design Pat- 
terns 


Most efforts in the literature of design patterns 
collect and catalog patterns. These activities are 
very important, since they are the basis for new soft- 
ware architectures using design patterns. Soukup 
[30] remarks that this basic work is not yet ended. 


Most authors present design patterns as guidelines 
for the design. When they are used in the design 
phase, the abstract pattern has to be transformed 
into a concrete implementation for each usage. A 
basic idea of this paper is to allow one to code a 
pattern once in an abstract way (e.g. for a pattern- 
library) and reuse it later in a specialized manner. 
The gained advantage is that patterns become (ab- 
stract) entities of the design process as well as of the 
implementation. This is similar to the use of the 
design process entities “object” and “class” which 
are also entities in object-oriented programming lan- 
guages. 


There are only a few efforts in the direction of 
language support for design patterns so far. We 
believe that one reason for this lack of support is 
due to the targeted languages. Conventional object- 
oriented programming languages, like C++, offer no 
support for reproduction of larger structures than 
classes (like design patterns). Therefore, it is nearly 
impossible to get a sufficient reproduction of such 
structures as an entity!. But there are more reasons 
[4], why language support for design patterns should 
be improved: 


e Traceability: The pattern is scattered over the 
objects and, therefore, hard to locate and to 
trace in an implementation. 


e Self-Problem: The implementation of several 
patterns requires forwarding of messages, e.g. 
an object A receives a message and forwards 
it to an object B. Once the message is for- 
warded, references to self refer to the delegated 
object B, rather than to the original receiver A. 
(known as the self-problem [17]). 


e Reusability: The implementation of the pattern 
must be recoded for every use. 


1Soukup [30] shows that some design patterns can be im- 
plemented as classes in C++ using friend, but Bosch [4] 
points out that these are only a few. 
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e Implementation Overhead: The pattern imple- 
mentation requires several methods with only 
trivial behavior, e.g. methods solely defined for 
message forwarding. 


Pree [28] identifies seven meta-patterns that define 
most of the patterns of Gamma et.al. [10]. This indi- 
cates that it is possible to find language constructs, 
which are able to represent all structures definable 
by these meta-patterns. In this work, we present the 
filter as such a language construct. 


2.2 Language Support for Design Pat- 
terns in XOTcl 


In the following sections, we describe the lan- 
guage support for design patterns we have devel- 
oped. We introduce our ideas with examples from 
the language Extended OTcL (XOTcL, pronounced 
exotickle) which is an extension of OTcL, but we give 
no introduction to the language. Figure 1 shows the 
relationship between XOTct and OTcz and lists im- 
portant properties of OTcL. 






Extended OTcl 















New Functionalities: Adopted from OTcl: | Other 
dynamic aggregations —_object-orientation: | Extensions 
nested classes encapsulation 
assertions inheritance 
meta-data 
per-object mix-ins multiple inheritance 
filter method chaining 
meta-classes | 
read/write Introspection | 


dynamic extensibility 


Figure 1: Language Extensions of XOTcL 


Tc. and OTcu already have many properties that 
are very helpful for the implementation of patterns. 
Dynamic typing, as stated above, eases the man- 
agement of highly generic structures. The definition 
of pattern parts as meta-classes makes them enti- 
ties of the programming language and instantiable 
with the name of the pattern. Introspection allows 
self-awareness and adaptive programs, and simpli- 
fies the maintenance of relationships such as aggre- 
gations. Per-object specialization eases implemen- 
tation of single objects with varying behavior, e.g. 
non-specializable singleton patterns. 


In addition to the abilities of OTcL, we imple- 
mented in XOTc.L new functionality specially tar- 
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geted on complex software architectures and pat- 
terns. In particular, we added: 


e nested objects based on Tci’s namespaces to (a) 
reduce the interference of independently devel- 
oped program structures, (b) to support nested 
classes and (c) to provide dynamic aggregations 
of objects. 


e assertions to reduce interface problems, to im- 
prove the reliability weakened by dynamic typ- 
ing and, therefore, to ease the combination of 
many components, 


e meta-data to provide self-documentation of ob- 
jects and classes, 


e per-object mizins as a flexible means to give 
an object access to several different addition- 
classes, which may be changed dynamically, and 
finally, 


e filters as a means of abstractions over method 


invocations to implement patterns (see Section 
2.3). 


The first three extensions are variations of known 
concepts, which we have adopted in a dynamical and 
introspective fashion, the last two are both novel 
approaches. In [24, 33] we describe all these features 
in detail, in this paper we solely describe the filter 
approach. 


2.3 The Filter Approach 


We have pointed out that the realization of design 
patterns as entities is a valuable goal and that the 
object-oriented paradigm is not able to achieve this 
through classes alone. OTct offers a means for the 
instantiation of large structures, like entire design 
patterns: the meta-classes. But in pure OTcu only 
a few patterns are instantiable this way (e.g. the 
abstract factory as in [33]), without suffering from 
the problems stated in Section 2.1. Typically, these 
patterns do not rely on a delegation or aggregation 
relationship. 


Even though object-orientation orders program 
structures around the data, objects are character- 
ized primarily by their behavior. Object-oriented 
programming style encourages the access of encap- 
sulated data only through the methods of the object, 
since this allows data abstractions [31]. A method 
invocation can be interpreted as a message exchange 
between the calling and the called object. Therefore, 
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objects are only traceable at runtime through their 
message exchanges. At this point the filters can be 
applied, which are able to catch and manipulate all 
incoming and outgoing messages of an object. 


A filter is a special instance method registered 
for a class C. Every time an object of class C 
receives a message, the filter is invoked auto- 
matically. 


A filter is implemented as an ordinary instance 
method (instproc) registered on a class. When the 
filter is registered, all messages to objects of this 
class must go through the filter, before they reach 
their destination object. The filter is free in what 
it does with the message, especially it can (a) pass 
(the potentially modified) message to other filters 
and finally to the object, or (b) it can redirect it to 
another destination, or (c) it can decide to handle 
the message solely. 


The forward passing of messages is implemented 
as an extension of the next primitive of OTcL. next 
implements method chaining without explicit nam- 
ing of the “mixin”-method. It mixes the same- 
named superclass methods into the current method 
(modeled after CLOS [5]). All classes are ordered in 
a linear next-path. At the point marked by a call to 
next the next shadowed method on this next-path is 
searched and, when it is found, it is mixed into the 
execution of the current method. 


In XOTctL, a single class may have more than one 
filter. All the filters registered for a class form an 
ordered filter chain. Since every filter shadows all 
instance methods, next appears as a suitable mech- 
anism to call the next filter in the chain. When all 
filters are worked through, the actual called method 
is invoked. By placement of the next-call, a filter 
defines if and at which point the remaining filters 
(and finally the actual method-chain) are invoked. 


Class A 

A instproc Filter-1 args { 
puts "pre-part of [self proc]" ;# pre part 
next ;# next call 
puts "post-part of [self proc]" ;# post part 


A filter Filter-1 
A al 
al set x 1 


This introductory example defines a single class 
and a single filter instproc. It registers the filter 
for the class using the filter instance method. An 
object a1 is created. In the last line the predefined 
set method is invoked. Automatically the registered 
filter Filter-1 of class A receives the message set. 
The filter instproc consists of three (optional) parts: 
The pre-part consists of the actions before the actual 


method is called, the next call invokes the message 
chaining, and the post-part contains the actions to 
be executed before the filter is left. In this example 
the pre- and post-parts are simple printing state- 
ments, but in general they may be filled with arbi- 
trary XOTc.-statements. The distinction between 
the three parts is just a naming convention for ex- 
planation purposes. 





Filter_3 
message (c) / Fitter2 } filter-chain 


message (b) 


Figure 2: Cascaded Message Filtering 


The following extension of the introductory exam- 
ple shows how to apply more than one filter, which 
are cascaded through next (see Figure 2). In this 
extended example a filter chain consisting of two fil- 
ters is used. Again next forwards messages to the 
remaining filters in the chain or to the actual called 
method. The method filter registers the list of fil- 
ters to be used. 

A instproc Filter-2 args { 

puts "only a pre-part in [self proc]" 
next 

} 

A instproc Filter-3 args { 


next 
puts "only a post-part in [self proc]" 


A filter {Filter-1 Filter-2 Filter-3} 


When an instance ai of class A receives a message, 
like “ai set x 1”, it produces the following output. 
The next-call in the last filter Filter-3 of the chain 
invokes the actual called method set. 

pre-part of Filter-1 

only a pre-part in Filter-2 


only a post-part in Filter-3 
post-part of Filter-1 


The filter method can be used to remove filters 
dynamically as well. E.g. the filters Filter-1 and 
Filter-3 can be removed by: 


A filter Filter-2 


On each class the filters are invoked in the order 
specified by the filter instance method. To avoid 
circularities all filters which are currently active — 
that means that the current call is invoked directly 
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pre-part 


. -Filter Chain of 
HD “ ClassB 
i 


b1 set x 10 
post-part 


Figure 3: Filter Inheritance 


or indirectly from a filter instproc — are temporarily 
left out of the filter chain. Filter chains can also be 
combined through (multiple) inheritance using next. 
Since filters are normal instprocs they may them- 
selves be specialized through inheritance. When the 
end of the filter chain of the object’s class is reached, 
the filter chains of the super-classes are invoked us- 
ing the same precedence order as for inheritance. 


This is demonstrated by the example displayed in 
Figure 3. B is a subclass of A with two instances b1 
and b2. Both instances are filtered with the chains 
registered on B, A and Object. The invocation bi set 
x 10 results in the next-path shown in Figure 3. 


Class B -superclass A 

B instproc Filter-B args { 
puts “entering method: [self proc]" 
next 


} 

B b1; B b2 

B filter Filter-B 
bi set x 10 


Filters have rich introspection mechanisms. Each 
class may be queried (using the introspection 
method info filters) what filters are currently in- 
stalled. A filter method can obtain information 
about itself and its environment, and also about 
the calling and the called method. Examples are 
the name of the calling and the called method, the 
class where the filter is registered, etc. (see for de- 
tails [24]). By using these introspection mechanisms 
filters can exploit various criteria in order to decide 
how to handle a message. 


Often it is useful to add filters to an existing chain 
of filters. This can be achieved conveniently by the 
instproc filterappend defined for the top-level class 
Object. Therefore this method is inherited by all 
classes. 


Object instproc filterappend f { 

(self] filter [concat [[self] info filters] $f] 
} 
A filterappend {Filter-2 Filter-3} 


5th USENIX Conference on Object-Oriented Technologies and Systems (COOTS '99) 


3 Language Support for Design Pat- 
terns using Filters 


Now we present a systematic approach how filters 
can be used to implement design patterns. In gen- 
eral, filters are very flexible and well-suited for im- 
plementing patterns in various creative approaches. 


3.1 Applying Filters on Meta-Patterns 


In [28] Pree has identified meta-patterns as struc- 
tures underlying several design pattern. They sub- 
divide the pattern into a general, generic pattern- 
class (called template) and a class which serves as 
an anchor for the application specific details (called 
hook). 



















Hook Class 


Specialized Hook Class 


Figure 4: The 1:1 Meta-Pattern [28] with a Special- 
ized Hook 


©) Template Class 


Figure 4 shows a simple meta-pattern, that is 
based on a 1:1 association. It separates the spe- 
cializations of a hook class from a template class. 
This is just an example pattern to give an idea of 
hook and template. It is obvious that many object- 
oriented structures, like several design patterns in 
[10, 6], are based upon this meta-structure. 


The methods of the template implement the 
generic part of the structure and invoke the hook 
methods. The abstract hook forms a common in- 
terface for its specializations. The structure can be 
reused with different special hooks without changes 
to the template. 


Filters are well-suited to implement meta- 
patterns. By using a filter all activities of a pat- 
tern can be treated in one entity (the filter instance 
method). Since all messages are directed to the fil- 
ter the abstract tasks of the pattern can be sepa- 
rated from actual tasks of the application. But this 
alone would not be a reusable solution, since for ev- 
ery template class of every task, where the pattern 
could be used, a new filter method would have to be 
implemented. In order to achieve reusability we use 
a meta-class that provides the desired functionality. 
This meta-class may be stored in a library and can 
be reused every time a similar problem occurs. 


USENIX Association 
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The steps, to obtain a reusable and instantiable 
pattern based on filters from a pattern class diagram 
(e.g. Figure 4), are: 


1. Find the hook and template classes. 


2. Create a meta-class under the general name of 
the pattern. 


3. Add a filter method to the meta-class, which 
performs all recurring tasks desired from the 
design pattern (especially the tasks of the tem- 
plate). 


4. Add a constructor to the meta-class, which reg- 
isters the filter on classes derived from the meta- 
class (and performs pattern specific initializa- 
tion tasks). 


5. Add additional methods to the meta-class (e.g. 
like registration of special hooks) to avoid hard- 
coding of pattern semantics in the filter method. 


With slight adaptations this scheme is applicable 
on all patterns that rely on Pree’s meta-patterns 
(e.g. most of the patterns in [10]). But neverthe- 
less most other patterns, since they normally involve 
messages exchanges, are supportable by filters. A 
meta-class can be defined as a general solution for 
a large number of related problems. In order to use 
it, the application must derive a class from it (e.g. 
with the name of the template class) and concretize 
the application specific actions (that means the hook 
classes). 


Now we show on a template for the 1:1 meta- 
pattern, how to apply the scheme in XOTcui. The 
first step is to define a meta-class. In XOTct a meta- 
class is defined by referencing the meta-class Class 
as superclass of a newly defined class: 


Class 1-1-Meta-Pattern -superclass Class 


Secondly, a filter instproc must be defined: 


1-1-Meta-Pattern instproc 1-1-Filter args { 
# filters actions 
# e.g. forwarding messages to the special hook 


} 


As the next step the constructor init registers 
the filter on the newly created class and performs 
other initialization tasks, like variable initialization, 
method declaration, etc.: 
1-1-Meta-Pattern instproc init args { 
# initialization tasks A 


{self] filterappend 1-1-Filter 
} 


For real applications the meta-class has to be ex- 
tended with additional methods. In order to com- 
plete the implementation of the 1:1 meta-pattern a 
method, which stores a reference to the special hook 
on the object, has to be defined. Finally, the meta- 
pattern is instantiated to create a filtered template 
class. 


1-1-Meta-Pattern FilteredTemplate 


In order to provide the hook for the filter a special 
hook class and perhaps concretizations have to be 
created. 


The presented scheme may be extended for more 
specialized patterns, e.g. a recursive pattern may re- 
quire recursive registering of the filter. Sometimes it 
is useful (but not necessary) to apply a second filter 
(e.g. in patterns with a second referencing relation- 
ship, like mediator or observer in [10]). 


3.2 Design Pattern Examples 


The idea underlying meta-patterns splits patterns 
into two parts: the template and the hook. We have 
shown a scheme how to apply a filter if this division 
is possible. This section applies the scheme in order 
to implement three example patterns from [10]. 


3.2.1 The Adapter Pattern 


The adapter pattern [10] converts the interface of 
a class into another interface that a client expects. 
Therefore, an adapter is a means to let classes coop- 
erate despite of incompatible interfaces. As shown 
in Figure 5 the conventional solution is to forward 
the messages from Adapter to Adaptee by explicit 
calls. This approach entails that for every adapted 
method a new additional method must be defined 
in the adapter. This leads to an implementation 
overhead. Moreover, the solution’s program code is 
neither reusable nor traceable. 


ZN 


Adapter 
Request Or- 












Adaptee 
SpecificRequest 






cee Sr r ett essssces 


Figure 5: The Adapter Pattern [10] 


The solution for the adapter problem presented 
below is based on filters and avoids these problems. 
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It is reusable and does not require the implementa- 
tion overhead resulting from methods which are de- 
fined solely for the purpose of message forwarding. 
The forwarding is handled automatically by the next 
primitive in the filter method, no additional helper 
methods are needed. 


By following the systematic steps presented above, 
we identify the template (here Adapter) and the hook 
(here Adaptee). The adapter pattern resembles the 
1:1 meta-pattern of Section 3.1, but it has no spe- 
cial hooks. The desired actions of the template are 
to forward requests to specific requests. This will be 
handled by the filter. Firstly, we define a meta-class 
which replaces the pattern from the conventional de- 
sign in Figure 5. 


Class Adapter -superclass Class 


A meta-class can be used to derive new classes that 
can access the instance methods of the meta-class. 
The derived classes are constructed with constructor 
of the meta-class. Next, we define the filter instance 
method: 
Adapter instproc adapterFilter args { 
set r [[self] info calledproc] 
{self] instvar specificRequest adaptee \ 
[list specificRequest($r) sr] 
if {[info exists sr]} { 
return [eval $adaptee $sr $args] 
} 


next 


} 


The info calledproc command returns the origi- 
nally called method. This is the general request 
which is to be mapped to a specific request. The two 
variables specificRequest and adaptee are instance 
variables which are linked to the current scope by 
the primitive method instvar. The specificRequest 
for the called method is mapped to the variable sr. 
adaptee is the object which handles the specific re- 
quests. If there exists a mapping of the current re- 
quest, the filter forwards the message to the asso- 
ciated method. Otherwise the message is not redi- 
rected, but passed further on by the filter along the 
next-path. 


As the next step we have to define the constructor 
which adds the filter to the class. In order to be able 
to set the specificRequest and adaptee variables it 
is convenient to define instprocs for this purpose, 
which are defined for the derived classes. These in- 
stprocs are created dynamically by the constructor 
(the init instproc) of the meta-class: 

Adapter instproc init args { 

{self] filterappend [self class] ::adapterFilter 
next 
(self] instproc setRequest {r sr} { 


[self] set specificRequest($r) $sr 
+ 
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Figure 6: The Adapter Pattern Using Filters 


{self] instproc setAdaptee {a} { 
[self] set adaptee $a 
} 
} 


Now the abstract pattern is converted into a meta- 
class, which can be used to derive classes with the 
behavior of the pattern: method invocations, which 
correspond to registered requests, are redirected to 
the adaptee object; all other invocations are passed 
unmodified to the object through the next-path (see 
Figure 6). 


The solution in [10] suffers from the self-problem, 
since the originally called object of the adapter class 
is not the object which performs the desired task. 
This problem is not addressed by the filter solu- 
tion presented above. A more sophisticated solu- 
tion, which does not suffer from the self-problem, 
is to define the filter on the adaptee instead of the 
adapter. For the sake of simplicity we presented here 
the slightly simpler version. 


A sample application of this pattern is a class 
which handles network connections. Derived classes, 
like FTP, HTTP, etc. allow one to handle specialized 
connections. All of them must implement a method 
connect. A method discard of the Connection class is 
able to close connections of all different kinds. Sup- 
pose a FTP connection routine from a library class 
with a different interface should be used. A filter 
adapter on basis of the defined meta-class can solve 
this problem elegantly. Firstly the interfaces of the 
related classes: 

# interface of the library class 


Class FTPLIB 
FTPLIB instproc FTPLIB_connect args {...} 


# the connection class 

Class Connection 

# an abstract connection method 
Connection instproc connect args {...} 

# the method to close a network connection 
Connection instproc discard args {...} 


. other class definitions, like HTTP 


Now we derive a class FTP from the Adapter. The 
meta-class’s constructor defines the two convenience 
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methods and registers the filter on the new class 
FTP automatically. Strictly speaking the conve- 
nience methods are not necessary, but they provide 
a simpler interface. The class FTP has a construc- 
tor that automatically creates an associated adaptee 
and provides the needed information for the filter 
though the convenience methods. 


Adapter FTP -superclass Connection 

FTP instproc init args { 
FTPLIB ftpAdaptee 
[self] setRequest connect FTPLIB_connect 
[self] setAdaptee ftpAdaptee 

+ 


Finally, the FTP class can be used and is adapted 
automatically. Since only the method connect was 
a registered request, all discard-calls reach the 
Connection class. 


FTP ftp1 
ftpl connect 


ftp1 discard 


This simple example can be extended with only a 
few more lines of code to provide more sophisticated 
adaptations (e.g. altering parameters, adapting to 
other objects, etc) without architectural redesign. 


3.2.2 The Composite Pattern 


A recursive pattern from [10] is the composite pat- 
tern, shown in Figure 7. The composite pattern 
helps to arrange objects in hierarchies with a unique 
interface type, called component. The objects are 
arranged in trees with two kinds of components: 
leafs and composites. Every composite can hold 
other components. 


Operation 
Add 


Remove 
GetChild 













' forall g in children 
| g.Operation(); 


Figure 7: The Composite Pattern [10] 


There are several disadvantages in the implemen- 
tation of the pattern in [10]. The composite pattern 
structure contains dynamic object aggregation, what 
is not provided in C++. Therefore, the implementa- 
tion lacks flexible mechanisms to handle and intro- 
spect the aggregation. An implementation overhead 
results from the necessity to define methods for man- 
agement of the dynamic aggregation. Furthermore, 


the scattering of the pattern across several classes, 
leads to a mixing of application and pattern struc- 
tures that reduces reusability. 


The pattern (as presented in [10]) is not an ab- 
stract entity; therefore, it is hard to specialize and 
to reuse it. Also, it is not easy to find it in source 
code, if it is not well commented, and both descrip- 
tion in the pattern classes and the runtime object 
structure are hard to introspect and not traceable. 


In order to implement the pattern as a filter, we 
firstly identify its elements. The composite class 
forms the template, the component class the hook. 
The desired action of the template is to forward all 
messages to the aggregated objects recursively. The 
application specific actions are the concretizations 
that determine what these classes do with the mes- 
sages. We create a meta-class: 


Class Composite -superclass Class 
Composite instproc addOperations args {...} 
Composite instproc removeOperations args {...} 


As a useful enhancement to the solution in [10], new 
operations are added and removed by addOperations 
and removeQperations (not to be confused with the 
methods for aggregation handling in Figure 7). Only 
registered operations will be forwarded to the ob- 
jects in the composite patterns runtime structure. 


All generic pattern tasks will be performed by a 
filter. It handles the forwarding to the components 
of a composite: 


Composite instproc compositeFilter args { 
({self] info class] instvar operations 
set r [[self] info calledproc] 
if {[info exists operations($r)]} { 

foreach object [[self] info children] { 
eval [self]::$object $r $args 
} 
} 
return [next] 


} 

In the composite filter firstly the request is com- 
pared to the operations in the operations-list. If 
the request is a registered operation, the message 
is forwarded to the child. Though children may be 
composites, this mechanism functions recursively on 
the entire structure, until the leaves are reached. 


In order to register the filter on a new composite 
class automatically, we append it in the constructor 
of the meta-class: 


Composite instproc init {args} { 
next 
{self] filterappend Composite: :compositeFilter 


Now we will show on an illustrative example that 
this single method handles all semantics of the pat- 
tern. As a sample application we will build up a 
simple graphic: 
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Class Graphic 
Graphic instproc draw {} {...} 


Different graphics objects can be defined on basis of 
the component type. For example we can define a 
Composite (Picture) with two leaves (Line and Rect- 
angle): 

Composite Picture -superclass Graphic 


Class Line -superclass Graphic 
Class Rectangle -superclass Graphic 





Figure 8: The Composite Object Structure 


The graphic structure shown in Figure 8 can be con- 
structed by: 
Picture aPicture 
Picture aPicture::bPicture 
Line aPicture::aLine 
Rectangle aPicture::aRect 
Line aPicture::bPicture::aLine 
Rectangle aPicture::bPicture::aRect 
Picture aPicture::bPicture::cPicture 
Picture aPicture::bPicture: :dPicture 
Line aPicture::bPicture::cPicture::cLine 


An invocation of the draw method on a complex ob- 
ject, like: 


Picture addOperations draw 


registers the draw message for all the component ob- 
jects in the structure. A call of draw draws the whole 
hierarchy: 


aPicture draw 


Note how simple and short it was to instantiate 
the sample application. Beneath the elimination of 
the problems mentioned above, compared to a so- 
lution of the picture application following [10], the 
filter solution is much shorter and easier to under- 
stand. It avoids complex structures that are con- 
nected in many ways, and removes the need for repli- 
cated code, since it takes the pattern semantics com- 
pletely out of the application. Furthermore, the re- 
sult is that the pattern is reusable as a program frag- 
ment (and may be put into a library of patterns) and 
not only as a design entity, which has to be recoded 
for every usage. 
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To map the recursive structure of the pattern 
a more general solution, in which each composite 
class gets recursively its own filter instproc, is easily 
achievable. This would allow one to specialize the 
filters behavior for certain branches of the structure 
(e.g. in order to fade out parts of the picture) or 
to store different composites, components or other 
classes in the pattern structure (what is possible to 
certain degree in the solution presented). 


3.2.3 The Observer Pattern 


The observer pattern presented in this section fulfills 
the task of informing a set of depending objects (“ob- 
servers”) of state changes in one or more observed 
objects (“subjects”). This problem is well known 
and often addressed, e.g. by the publisher subscriber 
pattern [6] or Model-View-Controller [14]. Figure 9 
shows the observer design pattern as presented in 
[10]. 










1 
for allo in observes { | 
o->Update() ' 
' 
‘ 








ConcreteObserver 





GetState O----- 
SetState 


subjectState 


i 
| observerState = 
| Subject->GetState 


Figure 9: The Observer Pattern [10] 


Bosch [3] identifies the problem that the trace- 
ability of the pattern suffers from the fact that the 
methods attach, detach and notify do not build up 
a conceptual entity and that the calls of notify must 
be inserted at every point where a state change oc- 
curs. The reusability of the concrete subjects also 
suffers from these problems. A filter, which directs 
all state changes of the subject to the observers does 
not have these problems and provides a reusable so- 
lution. 


In order to implement an observer pattern based 
on filters we create meta-classes for the observer and 
the subjects. The subjects are structured as nested 
class to preserve the unity of the pattern: 


Class Observer -superclass Class 
Class Observer: :Subject -superclass Class 


In this example we only handle the relationship be- 
tween subject (as template) and observer (as hook) 
by a filter. In a more sophisticated solution the 
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second referencing relationship between concrete ob- 
server and concrete subject may also be replaced by 
a filter. Now we can define a filter which handles the 
notification: 
Observer: :Subject instproc notifyFilter args { 
set r [[self] info calledproc] 
(self] instvar preObservers postObservers \ 
[list preObservers($r) preObs] \ 
[list postObservers($r) post0bs] 


if {{info exists preObs]} { 
foreach o $preObs {$o update [self] $args} 


set result [next] 
if {[info exists post0bs]} { 
foreach o $post0bs {$o update [self] $args} 
} 
return $result 


} 


Observers are registered with the attach and detach 
methods. As a special feature we allow both pre- 
and post-observers to be registered. When the filter 
method is invoked, firstly all registered pre-observers 
are informed, then the actual method is invoked and 
then all post-observers are informed. Finally, the 
filter returns the result of the called method. 


The trivial methods to register or unregister ob- 
servers (here: attachPre, attachPost, detachPre and 
detachPost) are created by the constructor init on 
all instantiated classes, so that their objects can 
reach them as instproc’s (not presented here). 


Observers 


Subjects 





Figure 10: Observer Example 


We demonstrate the usage of the abstract observer 
pattern by an example of a network monitor which 
observes a set of connections and maintains several 
views on these (e.g. a diagram and a textual out- 
put). In the implementation the class Pinger encap- 
sulates the view and collector classes, the collectors 
are treated as subjects of the observer: 

Class Pinger 

Observer: :Subject Pinger: :Collector 


Observer Pinger: :Diagram 
Observer Pinger: :TextOutput 


The Collector starts the observation of the network 
connection in its constructor, e.g.: 


Pinger: :Collector instproc init args { 
set hostName 132.252.180.67 
set f [open "| /bin/ping $hostName" r] 
fconfigure $f -blocking false 
fileevent $f readable "[self] ping \[gets $f\]" 
} 


The operation ping is the network event, which must 
be handled by the collector. Since the collector is 
a concrete subject it needs a method (getResponse) 
which is invoked by the observers to get its current 
state: 


Pinger: :Collector instproc ping {string} {...} 
Pinger: :Collector instproc getResponse {} {...} 


The two observers must concretize their update 
methods. Both must catch the actual state of the 
subject using getResponse and then they will up- 
date their presentation. The text output presenta- 
tion may look like: 


Pinger: :TextQutput instproc update {subject args} { 
set response [$subject getResponse] 
puts "PINGER: $subject --- $response" 

} 


For concrete applications the classes must be instan- 
tiated. Here are two collectors, some observers and 
some attachments: 


:Collector cl 
:Collector c2 
:Diagram dl 
:Diagram d2 
:TextOutput ti 


Pinger: 
Pinger: 
Pinger: 
Pinger: 
Pinger: 


cl attachPre ping di d2 
cl attachPost ping d2 t1 
c2 attachPost ping ti d2 


This attaches the diagrams and the text output to 
the collectors ci and c2 as pre- and as post-observers, 
as shown in Figure 10. 


4 Related work 


There are many other concepts with names con- 
taining the word “filter” (e.g. in the area of mo- 
bile/distributed computing [27, 16]). The composi- 
tion filter model [1] introduces the idea of a higher- 
level object interaction model through abstract com- 
munication types (ACTs). Besides such basic ideas 
of a means to change, redirect, or otherwise affect 
messages, we have not found an approach with com- 
parable properties like filters (as user-defined meth- 
ods, mixin of filter chains, inheritance, etc.). Never- 
theless, in Section 4.3, we describe other approaches 
providing language support for design patterns. 
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4.1 Meta-Object-Protocol 


One of the most flexible environments for object- 
oriented engineering is the CLOS environment with 
its meta-object-protocol [13]. We are convinced that. 
filters can be implemented in this environment which 
provides many hooks to influence the behavior and 
semantics of objects. Our filter approach differs sig- 
nificantly, since filters provide a high level construct, 
which is tailored to monitor and to modify object in- 
teractions. 


One example in [13] enhances CLOS with encap- 
sulated methods capable of restricting the access of 
private variables to methods of their class. The sys- 
tem’s method, used to apply methods, is enhanced 
with a sub-protocol which can add a set of function 
bindings to the method body’s lexical environment. 
The filter would have been a shorter and higher level 
solution for this problem, because it does not require 
modifications or additions to the underlying systems 
behavior. Therefore it does not require knowledge 
about the systems structure, like how the system ap- 
plies methods or how lexical definitions are bound to 
methods. Moreover, the filter solution can easier be 
scaled, since filters may be dynamically registered 
and unregistered. 


4.2 Meta-Programming 


From the abstraction point of view filters are 
closely related to the area of meta-programming, 
which was studied in the area of lisp-like languages 
(e.g. [2]) or in the area of logic languages, as sketched 
in this section. 


The filter approach is a very general mechanism 
which can be used, besides language support for de- 
sign patterns, in various other application areas. We 
see object-orientation and filters as an analogy to the 
interpretation layer introduced by meta-programs 
which are used to interpret existing programs in a 
new context with additional functionality (20, 21]. 
In [22] the abstraction introduced by layered inter- 
preters is called interpretational abstraction. The 
basic idea of interpretational abstraction is to treat 
program instructions of one program (source pro- 
gram) as data of another program (a meta-program, 
a compiler or interpreter) that reasons about the in- 
structions of the source program. During this rea- 
soning process new functionality can be introduced 
into the source program by interpreting the goals of 
the source program in a new context. Instead of al- 
tering the application program (the knowledge rep- 
resentation), an additional interpretation layer can 
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be introduced to change the behavior in certain sit- 
uations. This way interpreters can be used as a pro- 
gramming device. The inefficiency of the reasoning 
process can be eliminated by techniques like par- 


tial evaluation [8] or interpreter directed compilation 
(21). 


The filter approach is an introduction of meta-pro- 
gramming ideas into object-orientation. Even if the 
filter never accesses the real program (which a filter 
in XOTct could do through the provided introspec- 
tion mechanisms), it has full and unlimited access 
to the most important thing in object-oriented run- 
time structures: the messages. A filter handles mes- 
sages of objects as data which can be processed in 
arbitrary ways (modified, redirected, handled). The 
filters are able to reinterpret messages freely, the fil- 
ter methods are “interpreters” for messages and can 
influence all object communication. 


In general the application domain of filters is very 
wide. For example assertions and meta-data as pre- 
sented in [24] could have been implemented using 
filters. The only argument against this was, that 
the implementation in C is much faster than imple- 
mentation using filters, since in the current imple- 
mentation they reduce execution speed. However, 
it would be interesting to investigate, to what de- 
gree compilation methods like these described above 
could eliminate the overhead. 


4.3 Other Approaches for Supporting 
Design Patterns 


As stated above, Soukup [30] has identified prob- 
lems in the implementation of popular design pat- 
terns [10] and has shown that some patterns could 
be implemented as classes. 


The LayOM-approach [3] is the most similar to 
the filter approach. It offers an explicit represen- 
tation of design patterns using an extended object- 
oriented language. The approach is centered on mes- 
sage exchanges as well and puts layers around the 
objects which handle the incoming messages. Every 
layer offers an interface for the programmer to de- 
termine the behavior of the layer through a set of 
operators which are (statically) given by the layer 
definition. LayOM is a compiled language with a 
static class concept and can be translated into C++. 
The model is statically extensible with new layers. 


The filter approach differs from LayOM since 
it can represent design patterns as normal classes 
and needs no new constructs, only regular meth- 
ods. Therefore, the filter approach is closer to the 
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object-oriented paradigm. Furthermore, the filters 
can be dynamically reconfigured (added, removed, 
etc.) and are able to exploit introspection provided 
by the underlying language. 


The FLO-language [7] introduces a new compo- 
nent “connector” that is placed between interacting 
objects. The connectors are controlled through a set 
of interaction rules that are realized by operators 
(not normal methods). This connector-approach 
also concentrates on the messages of the objects but 
introduces the connectors as new entities. FLO is 
open for change (using a meta-object-protocol) and 
because the connectors are represented as objects it 
is close to the object-oriented paradigm. 


The introduced operators are not object-oriented 
by nature and, therefore, less intuitive in an object- 
oriented system than method invocation. The ap- 
proach of FLO involves a more complicated design, 
because in addition to the design patterns, connec- 
tor objects have to be defined. The filter approach 
can avoid this problem by the automatic registration 
of filters on the involved classes. 


Both mentioned approaches do not seem to offer 
the same ease as the filter in specializing an abstract 
pattern (like in [10]) to a concrete, more domain 
specific pattern (in the sense of [30]). Where the 
filters can simply use inheritance both approaches 
need the definition of a new domain specific layer or 
connector. 


Hedin [12] presents an approach based on an at- 
tribute grammar in a special comment marking the 
pattern in the source code. This addresses the prob- 
lem of traceability. The comments assign roles to the 
classes, which constrain them by rules like “A DEC- 
ORATOR must be a subclass of COMPONENT”. 
The system can test automatically (in the source 
code) if the realized pattern satisfies the given and 
derived constraint rules. 


This approach is not based on message exchanges 
(and is, therefore, rather simplistic), but it may be 
applied in any object-oriented language. It is only 
descriptive and not constructive (and, therefore, not 
reusable); each pattern must be commented again 
if it is applied to new application. The ability to 
assign constraints to patterns is interesting, espe- 
cially because XOTct provides similar abilities as 
well. The assertions can constrain classes (and ob- 
jects) formally and informally. Both the consistency 
of the pattern class and its instances can be checked 
at run-time. 


5 Conclusion 


The intention of this paper is to show that object- 
oriented scripting languages and the management 
of complexity are not contradictory and that it is 
possible to handle complexity with a different set of 
advantages and tradeoffs than in “systems program- 
ming languages”. Scripting is based upon several 
principles of programming, like using dynamic typ- 
ing, flexible glueing of preexisting components, using 
component frameworks etc., that can lead towards 
a higher productivity and software reuse. We have 
introduced a new language construct, the filter, that 
offers a powerful means for the management of com- 
plex systems in a dynamic and introspective fashion. 
It would have been substantially more difficult to im- 
plement dynamic and introspective filters in a sys- 
tems programming language. We believe that both 
scripting and object-orientation offer extremely use- 
ful concepts for a certain set of applications and that 
our approach is a useful and natural way to combine 
them properly. 


XOTcLt is available for evaluation 
http://nestroy.wi-inf.uni-essen.de/xotcl/ 


from: 
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Abstract 


The performance of CORBA (Common Object Re- 
quest Broker Architecture) objects is greatly influenced 
by the application context and by the performance of the 
ORB endsystem, which consists of the middleware, the 
operating system and the underlying network. Applica- 
tion developers need to evaluate how candidate applica- 
tion object architectures will perform within heteroge- 
nous computing environments, but a lack of standard 
and user extendable performance benchmark suites ex- 
ercising all aspects of the ORB endsystem under realis- 
tic application scenarios makes this difficult. This paper 
introduces the Performance Pattern Language and the 
Performance Measurement Object which address these 
problems by providing an automated script based frame- 
work within which extensive ORB endsystem perfor- 
mance benchmarks may be efficiently described and au- 
tomatically executed. 


1 Introduction 


The Common Object Request Broker Architecture 
(CORBA)[15] is emerging as an important open stan- 
dard for distributed-object computing, especially in 
heterogenous computing environments combining mul- 
tiple platforms, networks, applications, and legacy 
systems[27]. Although the CORBA specifications de- 
fine the features of a compliant ORB, they do not spec- 
ify how the standards are to be implemented. As a re- 
sult, the performance of a given application supported 
by ORBs from different vendors can differ greatly, as 
can the performance of different applications supported 
by the same ORB. 

A number of efforts have been made to measure the 
performance of ORBs, often comparing with perfor- 


*This work was supported in part by grants from Sprint 
Corporation. 


mance of other ORBs [23]. These efforts generally mea- 
sure only specific aspects of ORB performance in isola- 
tion. While performance of specific ORB functions is 
important, it is also important to realize that superior re- 
sults in a few simple tests does not ensure that the ag- 
gregate performance of ORB A is better than ORB B for 
a particular application object architecture. The perfor- 
mance of ORB based applications implemented as a set 
of objects is greatly influenced by the application con- 
text and by the architecture and performance of the ORB 
endsystem. The endsystem consists of the ORB middle- 
ware, the operating system and the underlying network. 
An application’s performance is determined by how well 
these components cooperate to meet the particular needs 
of the application. 

Current benchmark suites and methods tend to con- 
centrate on a specific part of the endsystem. Operating 
system benchmarks concentrate on component operat- 
ing system operations, but may say comparatively little 
about how well the operating system will support ORB 
middleware. ORB benchmarks concentrate on compo- 
nent operations of the middleware, but are less effec- 
tive at pinpointing problems at the application, operat- 
ing system, and network layers. Developers consider- 
ing non-trivial ORB based applications need the abil- 
ity to evaluate, in some detail, how well a given ORB 
and endsystem combination can support candidate ap- 
plication object architectures. They need this informa- 
tion before implementing a significant portion of the en- 
tire application. Such developers should begin with a 
set of standard performance benchmark suites exercis- 
ing various aspects of the ORB endsystem under realis- 
tic application scenarios, but they also require the ability 
to create test scenarios which specifically model their 
candidate application architectures and behavior in the 
endsystem context. 

Current benchmarking methods and test suites do not 
adequately solve the real problem developers face be- 
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cause current methods concentrate on only a part of the 
application and endsystem in isolation and thus do not 
enable the developer to consider how implementation 
decisions at various levels interact. An effective and 
efficient tool set supporting an integrated performance 
evaluation methodology should support ORB, endsys- 
tem, and application oriented tests, should be automated, 
and should make it easy for the user to extend and mod- 
ify the set of tests performed. Only such an integrated 
tool set and benchmark test suite supporting realistic ap- 
plication scenarios and capable of collecting information 
from all layers of the endsystem can enable developers 
to effectively evaluate candidate application object ar- 
chitectures before implementation. 

This paper describes how a combination of tools de- 
veloped at the University of Kansas (KU) can address 
this challenge. This integrated tool set represents a 
significant advance in support for performance evalua- 
tion of ORB based applications because it increases the 
range and complexity of tests that a benchmark suite 
can contain, it extends the types of performance in- 
formation which can be gathered during an individual 
test, and its support for automated test execution sig- 
nificantly extends the number of tests that a practical 
benchmark suite can contain. The NetSpec tool pro- 
vides a control framework for script driven automation 
of distributed performance tests. The Data Stream Ker- 
nel Interface (DSKI) provides the ability to gather time 
stamped events and a variety of other performance data 
from the operating system as part of a NetSpec exper- 
iment. The Performance Measurement Object (PMO) 
provides the ability to conduct NetSpec based experi- 
ments involving CORBA objects, and the Performance 
Pattern Language (PPL) provides a higher level lan- 
guage for describing NetSpec based experiments involy- 
ing sets of CORBA objects more succinctly. 

NetSpec has been used by a number of research 
projects at the University of Kansas (KU) and elsewhere. 
It provides the automation and script based framework 
supporting experiments including a wide range of condi- 
tions, component behaviors, and data collection [12, 16]. 
NetSpec is designed to be extended and modified by the 
user through the implementation of daemons. Test dae- 
mons support basic network performance tests and sup- 
ply background traffic in other NetSpec based experi- 
ments. Measurement daemons gather information dur- 
ing an experiment but contribute no traffic or behavior 
beyond that required to gather data. 

The DSKI is a pseudo-device driver which enables 
a NetSpec experiment, through the DSKI measurement 
daemon, to specify and collect the set of operating sys- 
tem level events of interest which occur during the ex- 
periment [1]. The PMO is a NetSpec test daemon de- 
signed to support CORBA based performance experi- 
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ments. A NetSpec PMO script can specify the creation 
of CORBA objects, their execution time behavior, and 
the relations that hold among the objects. Using exist- 
ing traffic related NetSpec test daemons, the DSKI, and 
PMO, a user can write a script specify a set of interact- 
ing objects, a set of network background traffic provid- 
ing a context within which the objects exist, and gather 
operating system level information about network and 
operating system level events affecting performance. 


A practical drawback of the NetSpec PMO support is 
that the language is defined at a low level of detail, and 
PMO scripts for scenarios with many objects are thus 
long and repetitive. The PPL addresses this by defin- 
ing a higher level language for more compactly describ- 
ing application level object interaction scenarios, which 
abstract the performance aspects of commonly used im- 
plementation strategies. We have called these scenarios 
performance patterns to draw a direct analogy to design 
patterns which the definitive book Design Patterns de- 
fines on page 3 as “descriptions of communicating ob- 
jects and classes that are customized to solve a general 
design problem in a particular context” [6]. 


A performance pattern is a set of objects exhibiting 
a set of behaviors, relationships, and interactions typical 
of an application architecture or class of application ar- 
chitectures. This pattern can be customized through pa- 
rameter specification or user extension to match the in- 
tended application behaviors and architecture as closely 
as required. The PPL compiler emits NetSpec PMO 
scripts implementing the specified performance pattern. 


It is important to realize that the PPL approach is 
quite general and is not ORB or even CORBA specific. 
The PPL could easily be used to create object based per- 
formance scenarios given support from a NetSpec dae- 
mon of the correct type. The PMO is CORBA specific, 
but it would be straightforward to implement and analo- 
gous NetSpec daemon for DCE or DCOM based per- 
formance evaluation. The PMO is not ORB specific 
and has been ported with minimal effort to four ORBs: 
The ACE ORB (TAO)[20], OmniORB[23], ExperSoft’s 
CORBAplus[5], and ILU[11]. We currently focus on 
TAO, OmniORB, and CORBAPIlus for project specific 
reasons. The range of experiments which can be sup- 
ported is a function, in part, of the set of possible ob- 
ject behaviors supported by the PMO. PMO behaviors 
are implemented by routines linked into the PMO, and 
it has been designed to make adding new behaviors sim- 
ple, thus supporting user extension. 


The rest of the paper first discusses related work in 
Section 2, and then describes the implementation of the 
PMO and PPL in Section 3. Section 4 presents exam- 
ples of PMO and PPL use, while Section 5 presents our 
conclusions and discusses future work. 
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2 Related Work 


A number of efforts have been made to measure the 
performance of ORBs, often comparing with perfor- 
mance of other ORBs [23]. Earlier studies on the per- 
formance of CORBA objects focussed mainly on identi- 
fying the performance constraints of an Object Request 
Broker (ORB) alone. Schmidt analyzed the performance 
of Orbix and VisiBroker over high speed ATM networks 
and pointed out key sources of overhead in middleware 
ORBs [7, 8]. This paper complements Schmidt’s work 
by demonstrating an integrated and automated approach 
which is capable of simultaneously measuring the influ- 
ence of the ORB, the operating system and the underly- 
ing network on the performance of CORBA objects. 

Studies have also been conducted on IDL compiler 
optimizations that can improve overall performance of 
the ORB. One such effort is the Flick project[4]. It 
is claimed that Flick-generated stubs marshal data be- 
tween 2 and 17 times faster than stubs produced by tra- 
ditional IDL compilers, resulting in an increased end-to- 
end throughput by factors between 1.2 and 3.7. While 
clearly addressing an important topic, the Flick work 
also clearly concentrates on one specific facet of endsys- 
tem performance. Our work complements such efforts 
by providing a platform within which the effect of such 
efforts on endsystem performance can be evaluated. 

TAO is the ACE ORB being developed at Washing- 
ton University [17]. This project focuses on: (1) identi- 
fying the enhancements required to standard ORB spec- 
ifications that will enable applications to specify their 
Quality of Service (QoS) requirements to ORBs, (2) 
determining features required to build real-time ORBs, 
(3) integrating the strategies for I/O subsystem architec- 
tures and optimizations with ORB middleware, and (4) 
to capture and document key design patterns necessary 
to develop, maintain and extend real-time ORB middle- 
ware. The work described in this paper compliments 
these goals by providing a way to capture, document, 
and evaluate performance aspects of ORB based design 
patterns. 

In addition to providing a real-time ORB, TAO is 
an integrated ORB endsystem architecture that consists 
of a high-performance I/O subsystem and an ATM port 
inter-connect controller (APIC). They have developed a 
wide range of performance tests which include through- 
put tests[7], latency tests[8] and demultiplexing tests[8]. 
They have used these performance tests to test TAO 
[17] and other CORBA2.0 compliant ORBs. Their tests 
formed a basis for several of the basic tests in the auto- 
mated framework described in this paper. 

A commercially available CORBA test suite is the 
VSORB from X/Open [28]. VSORB is implemented 
under the TETware test harness, a version of the Test 


Environment Toolkit (TET), a widely used framework 
for implementing test suites [24]. It is designed for 
two primary uses: (1) testing ORB implementations for 
CORBA conformance and interoperability under formal 
processes and procedures, and (2) CORBA compliance 
testing by ORB implementors during product develop- 
ment and quality assurance. This work differs from ours 
in that it concentrates on compliance rather than perfor- 
mance, but clearly shares the goal of creating a general 
framework for large scale evaluation tests. 

The Manufacturing Engineering Laboratory at the 
National Institute of Standards & Technology(NIST) 
takes a different approach towards the benchmarking 
of CORBA in their current work on the Manufacturer’s 
CORBA Interface Testing Toolkit(MCITT) [13]. They 
use a emulator-based approach in which the actual 
servers are replaced by test servers and the person do- 
ing the testing only needs to specify the behaviors that 
are important for the specific scenario being examined. 
The approach provides an extremely simplified procedu- 
ral language, the Interface Testing Language, for spec- 
ifying and testing the behavior of CORBA clients and 
servers. This work is similar to ours with respect to its 
abstraction of the object behavior, but it does not explic- 
itly integrate endsystem evaluation, concentrating only 
on the application and ORB middleware. 

The Distributed Systems Research Group at Charles 
University, Czech Republic, have done a comparison of 
three ORBs based on a set of criteria including dispatch- 
ing ability of the ORB, throughput provided for the in- 
vocation of different data types, scalability, and perfor- 
mance implications of different threading architectures 
(2, 19]. The criteria address different aspects of the 
ORB functionality and the influence of each criterion 
has been discussed with respect to specific ORB usage 
scenarios. They have also developed a suite of bench- 
marking applications for measurement and analysis of 
ORB middleware performance. This is a strong effort, 
but the drawback to this approach, in our view, is that 
it is restricted to evaluating ORB level performance and 
specific predefined application scenarios. This is signif- 
icant because application behaviors will vary and their 
method does not appear, by our understanding, to be de- 
signed to support user specified test scenarios. 

Performance evaluation is an important topic in many 
areas of computer system design and implementation, 
and significant related work exists which does not con- 
sider ORB performance. Data bases provide some of the 
best developed examples of benchmarks addressing ap- 
plication scenarios. It is interesting to observe that both 
data bases and ORBs support applications by assuming 
the role of middleware. As such, performance evaluation 
of data bases is most meaningful and useful to potential 
users when it considers application scenarios. 
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The Wisconsin Benchmark is an early effort to sys- 
tematically measure and compare the performance of 
relational database systems with database machines[3]. 
The benchmark is a single-user and single-factor exper- 
iment using a synthetic database and a controlled work- 
load. It measures the query optimization performance 
of database systems with 32 query types to exercise the 
components of the proposed systems. This is similar to 
our effort in that it abstracts the application scenario and 
considers a range of system functions. Our work differs, 
however, in that we also provide for placing the set of 
ORB based objects in an endsystem context including 
background load and traffic. 

The ANSI SQL Standard Scalable and Portable 
Benchmark (AS3AP) models complex and mixed work- 
loads, including single-user and multi-user tests, as well 
as operational and functional tests [25]. There are 39 
single-user queries consisting of utilities, selection, join, 
projection, aggregate, integrity, and bulk updates. The 
four multi-user modules include a concurrent random 
read test and a pure information retrieval (IR) test[26]. 
The concurrent random write test is used to evaluate the 
number of concurrent users the system can handle updat- 
ing the same relation. The mixed IR test and the mixed 
OLTP test are to measure the effects of the cross-section 
queries on the system with concurrent random reads or 
concurrent random writes. This effort has a stronger 
similarity to ours in that it considers a wider range of 
activity as well as multiple users. It does not, to our 
knowledge, provide support for users to specify applica- 
tion based test scenarios. 


3 Implementation 


We have implemented an integrated tool-based ap- 
proach for performance measurement of ORB endsys- 
tem performance. The single most important aspect of 
our system is that it measures performance within the 
target environment, rather than relying on published data 
that may be inaccurate, or which accurately describes as- 
pects of performance under a different environment. The 
main features of this approach are: 


1. A script based approach for conducting perfor- 
mance tests which promises better expressiveness 
of experiments. 


2. The ability to study the performance of CORBA 
objects in the context of different operating system 
loads and network traffic. 


3. The ability to study the influence of different com- 
ponents of the CORBA endsystem including the 
middleware, the operating system, and the network 
on the performance of CORBA objects. 
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4. The ability to measure the performance of objects 
in heterogeneous distributed systems from a single 
point of control. 


5. The flexibility and scalability to specify a wide 
range of distributed tests and behavior patterns. 
This includes scalability in time, number of objects, 
and number of hosts supporting the pattern. 


6. The ability to measure latencies, throughput and 
missed deadlines among a wide range of perfor- 
mance metrics. 


7. An automated highly scalable framework for per- 
formance measurement. This is a crucial fea- 
ture because it enables practical us of much 
larger benchmarking suites than non-automated ap- 
proaches. 


The performance metrics which best predict appli- 
cation performance depend, in part, on the properties 
of the application. This is one reason why a pattern 
based and automated framework is required. The pat- 
tern orientation enables the user to describe scenarios 
with a rich and varied set of behaviors and requirements, 
closely matching the proposed application architecture. 
Automation enables testing on a large scale, permitting 
the user to test a wide range of parameters under a wide 
range of conditions, which permits the user to avoid 
making many potentially unjustified assumptions about 
what aspects of the application, ORB, and endsystem are 
important in determining performance. 

The metrics which will be crucial for important 
classes of applications include: throughput, latency, 
scalability, reliability and memory use. The system 
parameters which can affect application performance 
with respect to these metrics include: multi-threading, 
marshalling and demarshalling overhead, demultiplex- 
ing and dispatching overhead, operating system schedul- 
ing, integration of I/O and scheduling, and network la- 
tency. Our approach currently enables us to examine 
the influence of many of these aspects of the system on 
performance, and further development will enable us to 
handle all of them. 

Figure | shows our integrated benchmarking frame- 
work supporting performance evaluation tests. The 
experiment description expressed in the PPL script is 
parsed by the PPL compiler which emits a PMO Net- 
Spec script implementing the specified experiment. The 
NetSpec parser processes the PMO based script and 
instructs the NetSpec controller daemon to create the 
specified sets of daemons on each host used by the 
distributed experiment. Note that Figure 1 illustrates 
a generic set of daemons, rather than those support- 
ing a specific test. The PMO daemon interfaces the 
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CORBA based objects on that host to the NetSpec con- 
trolling daemon. An additional PMO object is some- 
times used, and communicates with the PMO daemon, 
because CORBA objects can be created dynamically. 
Note that the line between the PMO objects represents 
their CORBA based interaction, which is the focus of 
the experiment. The DSKI measurement daemon, if 
present, is used to gather performance data from the 
operating system. It is a generic daemon and is not 
CORBA based. The traffic daemon is also not CORBA 
based, but is used to create a context of system load and 
background traffic within which the CORBA objects ex- 
ist. 

Our approach integrates several existing tools and 
adds significant new abilities specifically to support 
CORBA. The tools integrated under this framework 
are NetSpec[12, 16], the Data Stream Kernel Inter- 
face (DSKI)[1], the Performance Measurement Object 
(PMO)[10, 9], and the Performance Pattern Language 
(PPL). The rest of this section discussed each compo- 
nent in greater detail. 


3.1 NetSpec 


NetSpec has been used by a number of research 
projects at the University of Kansas (KU) and elsewhere. 
It provides the automation and script based framework 
supporting experiments including a wide range of condi- 
tions, component behaviors, and data collection [12, 16]. 
NetSpec is designed to be extended and modified by the 
user through the implementation of daemons supporting 
specific component roles in experiments. Test daemons 
are used as active components, traffic sources and sinks, 
while measurement daemons are passive with respect to 
the experiment since they only collect measurements. 
Existing NetSpec daemons support network level per- 
formance tests with many simultaneous connections and 
traffic load profiles, as well as data collection from both 
hosts and network nodes using measurement daemons. 
A wide range of NetSpec daemons exist, providing a 
range of behaviors and functions, including: TCP/UDP 
traffic load, ATM signaling load, SNMP data collection, 
and DSKI data collection from the operating system. 


3.2 Data Stream Kernel Interface 


The DSKI is a pseudo-device driver which enables 
a NetSpec experiment, through the DSKI measurement 
daemon, to specify and collect a series of time-stamped 
operating system level events of interest which occur 
during the experiment [1]. This is particularly useful 
when considering interactions among the application, 
middleware, and operating system levels of the endsys- 
tem. The primary target platform for the DSKI is Linux, 


but we have also ported it to DEC UNIX, and as a 
pseudo-device driver it can be ported relatively easily 
to any version of UNIX. The DSKI supports a range 
of data collection options with differing in level of de- 
tail and overhead. One particularly powerful feature is 
the ability to associate and arbitrary tag with an event. 
For example, when the tag is a packet ID or buffer ad- 
dress this enables post processing to track the progress 
of specific messages through the protocol stack. In the 
CORBA context, post processing of the event stream 
shows the amount of time spent by messages in differ- 
ent portions of the operating system when making ob- 
ject request calls to and from the ORB. We are currently 
working to create a similar ability to create, configure, 
and process streams of events from the CORBA and ap- 
plication levels. 


3.3. Performance Measurement Object 


The PMO is a NetSpec test daemon designed to 
support CORBA based performance experiments. The 
PMO enables a NetSpec script to specify the creation of 
CORBA objects, their execution time behavior, and the 
relations that hold among the objects. The PMO control 
layer parses the instructions from the NetSpec controller 
specifying its role within an experiment. These objects 
can exhibit a variety of behaviors, and are capable of ex- 
changing a wide range of CORBA data with each other. 

The PMO provides all the basic abilities required to 
conduct CORBA based evaluation experiments, but ex- 
perience has shown that it is not always the best way. 
The reason for this is that NetSpec’s method of ensur- 
ing user extendibility and portability also ensures that 
NetSpec scripts are very long. The best analogy is to 
consider the NetSpec scripting language an architec- 
ture independent assembly language. It is thus possible 
to describe any desired experiment, but sometimes te- 
dious. The PMO level is appropriate for describing ba- 
sic CORBA component tests, but can be unwieldy when 
used for application level object interaction scenarios. 
The PPL addresses this problem, and is discussed in 
Section 3.4 

The PMO NetSpec script language describes an ex- 
periment in terms of sets of daemons. Each daemon 
specification provides a complete list of parameter-value 
pairs describing that daemons role in the experiment. 
Groups of daemons are created and executed by the Net- 
Spec controller either in serial or in parallel. These 
simple constructs make it possible to describe a wide 
range of sophisticated application level behaviors. Ad- 
ditional constructs make it possible to have sets of dis- 
tributed subordinate controller daemons for large scale 
distributed experiments. The details of the NetSpec syn- 
tax are described elsewhere [12, 16], but the examples 
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Figure 1. The Integrated Benchmarking Framework 


described in Section 4 should provide a clear idea of how 
the system works. 


3.4 Performance Pattern Language 


The PPL was designed as a higher level language 
for describing such application level object interac- 
tion scenarios[14, 6] in terms of performance patterns. 
Within each pattern, the user describes objects, object 
behaviors, test types and relations among the objects 
that influence the performance of the pattern as a whole. 
For convenience, the PPL also permits the user to de- 
fine parameter blocks describing aspects of object be- 
havior which are referenced by object definitions using 
the same set of parameters. After the patterns associ- 
ated with the experiment are specified, the schedule for 
their execution is given. Currently the only schedule 
supported is a simple sequential execution of one pattern 
at a time. However, we are working on extending this to 
permit flexible pattern composition and dynamic time 
dependent behavior to better support application sce- 
nario based testing. The correspondence between PPL 
constructs and the PMO level scripts produced by the 
PPL compiler is illustrated by the examples in Section 
4. 

The combination of the PMO and PPL provides a 
powerful and efficient way for developers to describe 
and conduct a wide range of application scenario based 
performance evaluation experiments for CORBA sys- 
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tems. The method is applicable to any ORB and has 
been ported with minimal effort to four ORBs: The 
ACE ORB (TAO)[20], OmniORB2([23], ExperSoft’s 
CORBAplus[5], ILU[11]. The range of experiments 
which can be supported is a function, in part, of the set of 
possible object behaviors supported by the PMO. PMO 
behaviors are implemented by routines linked into the 
PMO, and it has been designed to make adding new be- 
haviors simple. 

The scalability of our method is important in two 
ways. First, script driven automation of the experiments 
makes it fairly easy to describe tests at a scale represen- 
tative of the final application. Second, the script driven 
automation makes it possible to conduct an acceptably 
large and comprehensive set of tests in an acceptably 
short period of time. For example, sets of tests produc- 
ing graphs discussed in Section 4 are fully automated 
and execute in periods ranging from a few seconds to al- 
most an hour. Scalability is important because the num- 
ber of properties of an ORB which can significantly af- 
fect performance of a particular application is large, re- 
quiring a large test suite for adequate evaluation. 


4 Evaluation 


This section illustrates current capabilities as well as 
the potential of our automated script driven and appli- 
cation scenario based performance evaluation methods 
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and tools. The examples show how the tests used in cur- 
rent benchmarks are supported by the framework, and 
how these can be used as components of more sophisti- 
cated scenario based performance patterns. This section 
presents results of two types of tests under two patterns 
to illustrate our methods. Section 4.1 presents results 
under the simple client-server pattern and behaviors for 
the cubit and throughput test types. Section 4.2 presents 
the results under the proxy pattern for the same behay- 
iors and test types. Section 4.3 demonstrates the use of 
the DSKI to reveal the components of the system sup- 
port overhead for the client-server pattern using a sim- 
ple request-response behavior. We also demonstrate the 
portability of our method by presenting results for both 
Linux and Solaris. Table 1 presents the Linux testing 
environment for the cubit and throughput behavior tests, 
while Table 2 presents that for Solaris. Note that the 
sending machine is slightly slower than the receiving 
machine. We originally used identical machines, but a 
machine failure forced us to use a different receiving 
machine for tests presented here. 












Name of ORB 
Language Mapping 
Operating System 


Redhat Linux 5.1 
kernel 2.1.126 
Pentium Pro 200 MHz 
128 MB RAM 
egcs-2.90.27 
(egcs-1.0.2 release) 
no optimizations 


Table 1. Operating Environment Used for 
the Tests on Linux Platform 


CPU info 





Compiler info 












Thread package 

Type of invocation 
Measurement method 
Network Info. 





Significant further development of our approach is 
desirable, and is proceeding, but the current capabilities 
of the tools generally meet and modestly exceed some 
aspects of current practice. It is important to note that the 
framework is explicitly designed for user extension pre- 
cisely because no single developer or authority can know 
every significant aspect of ORB evaluation. Accumula- 
tion of the sum of the CORBA community’s collective 
wisdom concerning ORB evaluation would significantly 
advance the state of the art. The script based automated 
approach described here is designed to support such a 
collective effort. 











Name of ORB omniORB2, TAO, 
emer eorsapas 


CPU info Ultra Sparc-II 296 MHz (S) 
Ultra Sparc-Ili 350 MHz (R) 
128 MB RAM 
Compiler info SUN C++ 4.2 


Network Info. ATM 













Table 2. Operating Environment Used for 
the Tests on Solaris Platform 


4.1 Simple Client-Server Pattern 


This example illustrates the basic elements of the 
PPL and PMO in the context of a simple client-server 
pattern, which reflects current conventional benchmark- 
ing practice. Listed below is the PPL script correspond- 
ing to the scenario of Figure 2. The client and server in 
this case are Sender and Receiver respectively. The in- 
formation regarding the parameters required for the test- 
ing between these two CORBA objects is provided in 
the object blocks of the PPL script and the kind of rela- 
tion between the objects is specified in the relation block 
of the PPL script. Execution of the pattern is specified 
by the one line schedule. 

The PPL compiler takes the script as input, analyzes 
the object definitions and relations, and generates the 
NetSpec PMO script shown in Figure 3. The first thing 
to note is that the PMO script has two major sections, 
one defining the client as a corba daemon running on 
the machine marcus, and the server as a corba daemon 
running on the machine zeno. The other major point is 
that the parameter block is specified explicitly for each 
daemon. The main point is that the PMO script defines 
each object separately and that the relations among them 
are more difficult to discern in the PMO language. 

Figure 4 shows the performance of the Client-Server 
pattern supporting the cubit test type for OmniORB and 
TAO on a Linux platform, while Figure 5 shows the re- 
sults for OmniORB, TAO and CORBAplus on a Solaris 
platform. The CORBAPlus ORB is not currently avail- 
able for Linux, but should be soon. The flexibility of the 
script driven approach is demonstrated by the observa- 
tion that the TAO based tests were repeated for the Om- 
niORB by replacing orb_name = TAO with orb_name = 
OmniORB in the PPL script. The cubit test emphasizes 
basic communication performance because it involves 
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pattern CUBIT-TESTS { 
param_block paraml { 


test_type = cubit; orb_name = TAO; 
minsize = 512; maxsize = 8192; 
predelay =. 5; postdelay = 5; 
duration = 10; multiples = 2; 
protocol = iiop; qos = normal; 
criteria = latency; 


} 


object Sender { 


machine_name = marcus; interface = eth; 
behavior = client; param = paraml; 
numsamples = 250; 

} 

object Receiver { 
machine_name = zeno; interface = eth; 
behavior = server; param = paraml; 
numsamples = 250; port_num = 22222; 


} 
relations { 
(TAO-sender, TAO-receiver) ; 
} 
} 
/* Execution Schedule */ 


CUBIT-TESTS; 


Figure 2. Simple Client-Server Pattern 


cluster { 
corba marcus { 


NameOfORB = TAO; 

TypeOftest = cubit; 

TestParams = ( 
numsamples = 250, minsize = 512, 
maxsize = 8192, multiples = 2, 
predelay = 5, postdelay = 5, 
duration = 10 ); 

protocol = iiop; 

objname = Sender; 

role = client; 

relations = server{Receiver}; 

criteria = latency; 

qos = normal; 

own = marcus (interface = eth); 


} 


corba zeno { 
NameOfORB 
TypeOftest 
TestParams 
numsamples 
maxsize 


TAO; 
cubit; 
( 


250, minsize 

8192, multiples 
predelay Se postdelay 
duration LG) ):3 

protocol = iiop; 

objname Receiver; 

role server; 

relations = client({Sender); 

criteria = latency; 

qos = normal; 

own = zeno (interface = eth, port = 22222); 


512, 


nono 
N 


5, 


nou d 


i a 


Figure 3. Corresponding Client-Server 
PMO NetSpec Script 
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Figure 4. Client-Server Cubit for OmniORB_' 
and TAO on Linux 


small packets, and a simple computation (cube a num- 
ber) on the server side. The cubit behavior thus focuses 
on the time spent by each packet in the system layers 
and the middleware for a CORBA call invocation. The 
results shown are the average values for 250 invocations 
of the basic operation for each of several CORBA data 
types, and are presented in terms of calls per second. 

There are several points of interest in these results. 
First, is the fact that even such a simple test reveals dif- 
ferences between ORB implementations, and between 
operating system platforms. The most striking differ- 
ence is that while TAO performance is essentially con- 
stant on both Linux and Solaris, OmniORB performance 
on Solaris is roughly double that on Linux for many data 
types but not all. Another observation is that OmniORB 
generally outperforms the other ORBs, but that its per- 
formance for the “Ilse” (long long sequence) data type is 
substantially below that of TAO on Linux. 


Determining why these observed behaviors occur 
will take further study, but this demonstrates the im- 
portant point that our compact PPL script describes a 
test which can be run automatically in a matter of sec- 
onds, revealing significant differences in ORB behav- 
ior, and providing a convenient and efficient founda- 
tion for further experimentation. The flexibility of the 
PPL approach is further illustrated by changing the test 
type from cubit to throughput in the client-server pat- 
tern, producing the Linux throughput results for TAO 
illustrated in Figure 6 and Figure 7 for OmniORB. A 
data file, essentially CORBA “char” data type, ranging 
from 1 MB to 64 MB is sent using buffer sizes ranging 
from 512 bytes to 16 KB. This test shows that through- 
put for both ORBs is constant with the total amount of 
data, but that throughput is significantly affected by the 
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Figure 5. Client-Server Cubit for OmniORB, 
TAO and CORBAplus on Solaris 


buffer size used for each data transfer session. The TAO 
throughput increased with buffer size, indicating that the 
packet transfer rate was limited, but not the packet size. 
The OmniORB throughput varied in a much less obvi- 
ous way, and was significantly greater for 4KB buffers. 
Determining why OmniORB performance varies so hap- 
hazardly with buffer size would require gathering data 
from the operating system layer, as discussed in Section 
4.3. The throughput tests for a single client-server pair 
ran under NetSpec control in an elapsed time of approx- 
imately 15 minutes. 


4.2 Proxy Pattern 


This section discusses a more complex CORBA ap- 
plication scenario, the Proxy Pattern [22], in which the 
proxy object acts as an interface between the CORBA 
clients and CORBA servers as shown in Figure 8 for 
three client and server objects, with the PPL script im- 
plementing this pattern for the cubit test type under Om- 
niORB. The proxy pattern uses the basic client-server 
pattern as a component, extending it to a group of client- 
server pairs communicating through a proxy object. In 
this case we use three client server pairs under the proxy 
pattern which exhibit the client and server behaviors, re- 
spectively, while executing the cubit and throughput test 
types. The client contacts the corresponding server at 
run-time either by passing the object reference, or the 
server’s name registered with the CORBA Naming Ser- 
vice, to the Proxy object which forwards the client re- 
quest to the appropriate server. The data type used for 
the transfer of information between the clients and the 
proxy Object is CORBA “Any”. 








3 4 
Data Size 240" 


Figure 6. Client-Server Throughput for TAO 
on Linux 





3 4 
Data Size x10" 


Figure 7. Client-Server Throughput for Om- 
niORB on Linux 
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Figure 9. Proxy Cubit Results for OmniORB 


pattern Proxy ( and TAO on Linux 
param_block client-param { 


orb_name = omniORB2; test_type = cubit; 600 
numsamples = 250; minsize # 1: 
maxsize = 1; multiples = 1; 
predelay = 3; postdelay = 3; 
protocol = iiop; qos = normal; oe 
criteria = latency; interface = eth; 
machine_name = marcus; 
} 400 
param_block server-param { 
orb_name = omniORB2; test_type = cubit; bee 
predelay = 3; postdelay = 4; é 
interface = eth; machine_name = zeno; 3 
} 200 


object proxy-Cl { 

behaviour = client; param = client-param; 
} 100 
object proxy-C2 ( 

behaviour = client; param = client-param; 
predelay 5; postdelay = 3; 0 
} 
object proxy-C3 { 

behaviour = client; param = client-param; - 
predelay = 7; postdelay = 3; Figure 10. Proxy Cubit Results for Om- 
} . . 
SRYRGE PEEMABL 4 niORB, TAO and CORBAplus on Solaris 
behaviour = server; param = server-param; 

port_num = 10000; 3 
} 
object proxy-S2 { 

behaviour = server; param = server-param; 


long stu unio sise 
Test Types for the Cubit Proxy Test Using an Any on Solaris platiorm 








port_num 20000; 
} 26 
object proxy-S3 { 

behaviour = server; param = server-param; 

port_num = 30000; fy 
} s 
object proxy-router { : J 

behaviour = proxy ; param = server-param; 22 

port_num = 30003; e 


} 


relations { 
(proxy-Cl,proxy-router); (proxy-C2,proxy-router) ; 
(proxy-C3,proxy-router); (proxy-router,proxy-S1) ; 
(proxy-router,proxy-S2); (proxy-router,proxy-S3) ; 
} 
} 


/* Execution Schedule */ 
Proxy; 


Figure 8. Proxy Pattern and PPL Script Figure 11. Proxy Throughput for OmniORB 


on Solaris 
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The proxy object can be used in two modes. In the 
first, the proxy only plays a role when establishing a 
connection between the client and server. In the other 
mode, the proxy actually routes the data between the 
objects, with a significant effect on performance. We 
present results for the second mode. 

Figure 9 shows the performance of the omniORB and 
TAO client objects using the cubit test type under the 
proxy pattern on Linux, while Figure 10 shows the per- 
formance of clients under those ORBS as well as under 
CORBAplus on Solaris. The number of calls per sec- 
ond shown in Figure 9 are the average of the numbers 
of the three clients in both OmniORB and TAO. There 
was some non-trivial variance among clients for some 
tests and some ORBs, which would be another interest- 
ing point for further investigation. However, we illus- 
trate the use and utility of our methods using the average 
results, within which there are several points of interest. 

The most obvious point is that using the proxy object 
to mediate data transfer between client and server signif- 
icantly impacts performance, reducing it to approximate 
10 percent of that for the simple client-server pattern. 
Some impact is certainly expected due to the use of three 
concurrent client-server pairs, and reduction to 30 per- 
cent of the single pair performance would be plausible. 
Clearly, using a proxy object has a significant additional 
impact on performance. While not particularly surpris- 
ing, this result emphasizes the importance of application 
scenario based testing. This pattern was, for example, 
discussed in a popular magazine [22] and is used by one 
of our colleagues as the basis for a WWW meta-search 
engine. Clearly, any developer contemplating such an 
architecture would be grateful to know the likely impact 
before implementing the software. 

The second point of interest is that both TAO and 
OmniORB enjoy a significant performance increase in 
moving from Linux to Solaris, while the TAO perfor- 
mance for the client-server pattern was relatively con- 
stant between the two systems. A third significant ob- 
servation is that the magnitude of the performance in- 
crease for OmniORB in moving from Linux to Solaris is 
much greater, increasing three to five fold in most cases. 
Finally, the difference between TAO and CORBAplus 
performance under Solaris is greater under this pattern 
than under the client-server pattern. 

These observations support our assertion that ap- 
plication performance scenarios, performance patterns, 
should be part of any comprehensive benchmark. The 
comparative performance between different ORBs on 
the same operating system and between the same ORB 
on different operating systems changed significantly 
with the change in pattern. This also supports our idea 
that developers using performance results to select an 
ORB and operating system as an implementation plat- 


form should use test results for object architectures, per- 
formance patterns, which faithfully represent their pro- 
posed application. 


Throughput in Mbps 





Data Size 


Figure 12. Proxy Throughput for TAO on 
Solaris 


We also changed the test type, as we did for the 
client-server pattern, to test the throughput performance 
among the object pairs. Figure 11 shows results for the 
OmniORB under Solaris while Figure 12 presents the 
throughput results for TAO. Both tests show throughput 
is reduced from five to ten fold. The TAO results still 
show an orderly increase in throughput with buffer size, 
although this converges to a level of 2 Mb/s for all but 
the smallest buffer size. OmniORB performance, in con- 
trast, does not vary in nearly as orderly a manner with 
buffer or data set size, and does not converge to similar 
throughput for most buffer sizes. The performance using 
8 KB, 4 KB, and 2 KB buffers is particularly interesting. 
As with the client-server pattern, the 4 KB buffer size 
provides the best performance, but 8 KB buffers do sub- 
stantially better for small data set than large. This could 
easily be due to system level buffering effects. 

Determining why the throughput varies in these ways 
with data and buffer size will require gathering informa- 
tion from the operating system layer to see if the net- 
working protocols play a role, and gathering informa- 
tion from the ORB layer to see if there is an influence at 
that level. Section 4.3 illustrates how we might use the 
DSKI to gather protocol layer information, but discusses 
a simpler example. 


4.3 Using the DSKI 


This section briefly illustrates the use of the DSKI to 
gather performance information from the operating sys- 
tem during a test using the simple client-server pattern 
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discussed in Section 4.1. As discussed in Section 3 the 
DSKI creates a stream of time stamped records for each 
occurrence of a predefined event in the operating sys- 
tem kernel. The set of event records produced can be 
post-processed to calculate the time spent in providing 
different types of system services. In this case, the time 
spent in various portions of the TCP/IP stack can be cal- 
culated because we have defined a set of events capable 
of tracing the progress of packets through the TCP/IP 
stack. Figure 13 presents the NetSpec PMO script im- 
plementing the experiment. 


cluster { 
corba testbed2 { 

NameOfORB = omniORB2; 

TypeOftest = rrstring; 

TestParams = ( 
numsamples = 1, minsize = 1, 
maxsize = 2, multiples = 2, 
predelay = 3, postdelay = 3, 
duration = 30 ); 

protocol = iiop; 

objname = omni-receiver; 

role = server; 

relations = client {omni-sender) ; 

criteria = throughput; 

qos = normal; 

own = testbed2 (interface = eth, 


port = 41777); 
} 


corba testbedl ( 


NameOfORB = omniORB2; 

TypeOftest = rrstring; 

TestParams = ( 
numsamples = 1, minsize = 1, 
maxsize = 2, multiples = 2, 
predelay = 3, postdelay = 3, 
duration = 30 ); 

protocol = iiop; 

objname = omni-sender; 

role = client; 

relations = server{omni-receiver); 

criteria = throughput; 

qos = normal; 

own = testbedl (interface = eth); 


} 


dstream testbedl { 
type = active (numevents = 100, port=40778, 


duration=30) ; 


ds_tcpip = all; 
} 


dstream testbed2 { 
type = active (numevents = 100, port=40778, 
duration=30) ; 
ds_tcpip_read = all; 
} 
} 


Figure 13. PMO NetSpec Script using DSKI 


Figure 14 presents the packet flow in and out of the 
Sender and Receiver hosts, as well as the performance 
figures from the kernel obtained using the DSKI. The 
socket, TCP, IP, and Ethernet layers are numbered | 
through 4 on the sending host, and 5 through 8 from the 
bottom up on the receiver host. The path of a packet sent 
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Figure 14. Time trace in the Operating Sys- 
tem Layers Obtained Using DSKI 


from the sender to the receiver in the diagram would thus 
flow through layers in numerical order. The bar graph in 
Figure 14 shows the time in microseconds spent by a 
packet in each layer, by number. 

This experiment demonstrates that the DSKI pro- 
vides the ability to gather fine grain operating system 
level time stamped events. Note that the Y-axis in Fig- 
ure 14 is a logarithmic scale, showing that we were able 
to monitor time intervals ranging from 10 microseconds 
to 10 milliseconds. It also illustrates an important fea- 
ture of the DSKI which we call tagging. Each event in 
the operating system can include a context dependent 
tag when it logs the event. In the case of tracing TCP/IP 
performance we used the port and sequence numbers to 
uniquely identify each packet as it moves among proto- 
col layers on each machine. 

A similar approach would be used to investigate how 
buffer size affects throughput under the proxy pattern. 
DSKI events could be used to monitor when and how 
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packets are transmitted, combined or broken into frag- 
ments, and what size system buffers are used to hold 
them. Tagging would be used to determine the progress 
of each packet, and would tell us if the buffer size influ- 
enced protocol behavior. Monitoring at the system level 
might also tell us something about the middleware, even 
without explicit instrumentation. If, for example, buffers 
of one size were given to the CORBA layer, but buffers 
of a different size were passed onto the system layer we 
would know that the middleware was manipulating the 
data buffers. 

The DSKI can thus be used to investigate a wide 
range of interactions between the operating system and 
software layers using its services. This is by no means 
a simple endeavor, since it requires access to the sys- 
tem source code and sufficient knowledge of the system 
to enable the investigator to define a reasonable set of 
events, and then to interpret the results. However, for 
the investigator willing to learn how to use it well, it 
can provide an important new source of detailed infor- 
mation from the operating system layer which plays an 
important role in determining what aspects of endsystem 
architecture limit application performance under various 
sets of execution conditions. 


5 Conclusions and Future Work 


The performance of CORBA based applications im- 
plemented as sets of objects is greatly influenced by 
by the application context and by the performance of 
the ORB endsystem. Application developers need to 
evaluate how candidate application object architectures 
will perform within heterogenous computing environ- 
ments, but a lack of standard and user extendable per- 
formance benchmark suites exercising all aspects of the 
ORB endsystem under realistic application scenarios 
makes this difficult. This paper introduced the Perfor- 
mance Pattern Language and the Performance Measure- 
ment Object which address these problems by provid- 
ing, under NetSpec control, an automated script based 
framework within which extensive ORB endsystem per- 
formance benchmarks may be efficiently described and 
automatically executed. 

The tools described are implemented, and the viabil- 
ity of the framework they provide has been demonstrated 
by implementation of small but non-trivial sets of perfor- 
mance evaluation scripts. The examples presented show 
that the full range of evaluation information can be gath- 
ered and a rich set of performance scenarios examined. 
The automated nature of the script driven framework is 
also important because it makes it possible to describe 
and conduct a large set of evaluation experiments cov- 
ering a adequately diverse and detailed set of scenarios 
and performance metrics. 


Performance evaluation of CORBA based distributed 
applications, and of candidate object architectures, is 
an extremely important and difficult problem. Current 
benchmarking and testing methods are not as compre- 
hensive as they might be because the scale and complex- 
ity required is daunting. The tools described here make a 
significant increase in the scale, complexity, and level of 
detail of performance evaluation studies possible, thus 
significantly advancing the state of the art. 

Our future work will include creation of new test 
types and performance patterns. We are particularly in- 
terested in extending this approach to testing to include 
execution of applications under real-time constraints. 
We will use this set of tests to drive an investigation 
of what kinds of system support can be used to im- 
prove real-time performance of ORB based applications. 
We will concentrate on a time constrained event service 
and integration of operating system scheduling, I/O, and 
ORB level operations to improve time constrained com- 
munication among objects. 


Availability 


NetSpec and many daemons developed for various 
types of performance evaluation are publicly available. 
The PMO and the PPL have been developed with sup- 
port from Sprint, and are not yet publicly available, but 
should be soon. The work described here is intended 
as a contribution to the CORBA community and are in- 
tended for full availability on the WWW. For further de- 
tails check: 


www.ittc.ukans.edu/~niehaus/research. html 
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Abstract 


The CO»2P3S parallel programming system uses design 
patterns and object-oriented programming to reduce the 
complexities of parallel programming. The system gen- 
erates correct frameworks from pattern template spec- 
ifications and provides a layered programming model 
to address both the problems of correctness and open- 
ness. This paper describes the highest level of abstrac- 
tion in CO2P3S, using two example programs to demon- 
strate the programming model and the supported pat- 
terns. Further, we introduce phased parallel design pat- 
terns, a new class of patterns that allow temporal phase 
relationships in a parallel program to be specified, and 
provide two patterns in this class. Our results show that 
the frameworks can be used to quickly implement par- 
allel programs, reusing sequential code where possible. 
The resulting parallel programs provide substantial per- 
formance gains over their sequential counterparts. 


1 Introduction 


Parallel programming offers potentially substantial per- 
formance benefits to computationally intensive applica- 
tions. Using additional processors can increase the com- 
putational power that can be applied to large problems. 
Unfortunately, it is difficult to use this increased compu- 
tational power effectively as parallel programming intro- 
duces new complexities to normal sequential program- 
ming. The programmer must create and coordinate con- 
currency. Synchronization may be necessary to ensure 
data is used consistently and to produce correct results. 
Programs may be nondeterministic, hampering debug- 
ging by making it difficult to reproduce error conditions. 


We need development tools and programming tech- 
niques to reduce this added programming complexity. 
One such tool is a parallel programming system (PPS). 


A PPS can deal with complexity in several ways. It 
can provide a programming model that removes some 
of the added complexity. It can also provide a com- 
plete tool set for developing, debugging, and tuning pro- 
grams. However, it is likely that there will still be some 
additional complexity the user must contend with when 
writing parallel programs, even with a PPS. We can ease 
this complexity by using new programming techniques, 
such as design patterns and object-oriented program- 
ming. Design patterns can help by documenting work- 
ing design solutions that can be applied in a variety of 
contexts. Object-oriented programming has proven suc- 
cessful at reducing the software effort in sequential pro- 
gramming through the use of techniques such as encap- 
sulation and code reuse. We want to apply these benefits 
to the more complex domain of parallel programming. 


The CO»P3S! parallel programming system supports 
pattern—based parallel program development through 
framework generation and multiple layers of abstraction 
[8]. The system can be used to parallelize existing se- 
quential code or write new, explicitly parallel programs. 
This system automates the use of a selected set of pat- 
terns through pattern templates, an intermediary form 
between a pattern and a framework. These pattern tem- 
plates represent a pattern where some of the design al- 
ternatives are fixed and others are left as user—specified 
parameters. Once the template has been fully specified, 
CO»P3S generates a framework that implements the pat- 
tern in the context of the fixed and user—supplied design 
parameters. Within this framework, we introduce se- 
quential hook methods that the user can implement to 
insert application—specific functionality. In the CO2P3S 
programming model, the higher levels of abstraction em- 
phasize correctness and reduce the probability of pro- 
grammer errors by providing the parallel structure and 
synchronization in the framework such that they cannot 
be modified by the user. The lower layers emphasize 
openness [14], gradually exposing low-level implemen- 
tation details and introducing more opportunities for per- 
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formance tuning. The user can work at an appropriate 
level of abstraction based on what is being tuned. 


The key to reducing programmer errors is the decompo- 
sition of the generated framework into parallel and se- 
quential portions. The framework implements both the 
parallel structure of the program and the synchronization 
for the execution of the hook methods. Neither of these 
attributes of the program can be modified at the high- 
est level of abstraction. This decomposition allows the 
hook methods to be implemented as sequential methods, 
sO users can concentrate on implementing applications 
rather than worrying about the details of the parallelism. 


We introduce phased parallel design patterns in this pa- 
per. Phased patterns are unique in that they express 
a temporal relationship between different phases of a 
parallel program. Although all patterns have temporal 
aspects (such as specific sequences of method invoca- 
tions), the intent of phased patterns is to deal with chang- 
ing concurrency requirements for different phases of a 
parallel program. We introduce two such patterns, the 
Method Sequence and the Distributor, both of which are 
new. The Method Sequence can be used to implement 
phased algorithms, explicitly differentiating between the 
different phases of a parallel algorithm. This pattern rec- 
ognizes that efficiently parallelizing a large program will 
likely require the application of several parallel design 
patterns. The Distributor pattern allows the user to se- 
lectively parallelize a subset of methods on an object, 
acknowledging that not all operations may have suffi- 
cient granularity for parallel execution. 


We also introduce a structural pattern, the Two—Dimen- 
sional Mesh. The Mesh was written in an object— 
oriented fashion to fit within the CO2P3S system. 


To demonstrate the use of the CO2P3S system, we 
present the development of two example programs. Our 
first example, reaction—diffusion texture generation [17], 
uses the Mesh pattern to simulate the reaction and diffu- 
sion of two chemicals over a two-dimensional surface. 
The second program implements the parallel sorting by 
regular sampling algorithm [13] using the Method Se- 
quence and Distributor patterns. Both programs are im- 
plemented using the facilities provided at the highest 
level of abstraction in CO»2P3S to demonstrate the util- 
ity of our patterns and the utility of the frameworks we 
generate to support our pattern templates. 


The research contributions of this paper are: 


e A layered parallel programming model that 
presents several different programming abstrac- 
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tions to the user. Each layer emphasizes different 
concerns, starting with program correctness at the 
highest level of abstraction, and providing openness 
and performance tuning at lower levels. 


e The generation of a correct parallel framework 
from a pattern template specification, coupled with 
correctness guarantees by preventing the user from 
modifying the structure of the framework at the 
highest level of abstraction. 


An easy-to-use programming model that allows 
users to implement a parallel program by writing 
a small amount of sequential code to reuse their ex- 
isting application code. 


An object-oriented pattern—based tool for parallel 
programming. We also introduce a new type of 
parallel design pattern, called phased patterns, to 
express time-related aspects of a parallel program. 


A demonstration of the benefits of the CO2P3S sys- 
tem using two example programs. These examples 
illustrate the use of the highest level of abstraction 
in the CO2P3S model and demonstrate the benefits 
of using a high-level tool for parallel programming. 


2 Overview of the CO,P3S System 


This section presents a brief overview of the CO2P3S 
system. A more detailed description can be found in [8]. 


The CO2P3S parallel programming system provides 
three levels of abstraction that can be used in the de- 
velopment of parallel programs. These abstractions pro- 
vide a programming model that allows the programs to 
be tuned incrementally, allowing the user to develop par- 
allel programs where performance is more directly com- 
mensurate with effort. These abstractions, from highest 
to lowest, are: 


Patterns Layer The user selects a parallel design pat- 
tern template from a palette of supported templates. 
These templates represent a partially specified de- 
sign pattern, where some of the design tradeoffs’ 
have been fixed. Each pattern template has sey- 
eral parameters that must be supplied before the 
template can be instantiated, allowing the user to 
specialize the framework implementing the pattern 
for its intended use. Instantiating the pattern tem- 
plate generates code that forms a framework for 
the template. The code consists of one or more 
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abstract classes that implement the parallel struc- 
ture of the pattern template together with concrete 
subclasses, as well as any required collaborator 
classes. The framework code is customized in two 
ways: application—-specific pattern template param- 
eters and user—supplied implementations of specific 
sequential hook methods in the concrete subclasses 
(using the Template Method design pattern [4]). 
Since the user cannot modify the parallel structure 
at this layer, parallel correctness is ensured. A com- 
plete program consists of either a single framework 
or several frameworks composed together. 


Intermediate Code Layer This layer provides a high— 
level, object-oriented, explicitly—parallel program- 
ming language, a superset of an existing OO lan- 
guage. The user manipulates both parallel struc- 
tural code and application-specific code using this 
intermediate language. 


Native Code layer The intermediate language is trans- 
formed into code for a native object-oriented lan- 
guage (such as Java or C++). This code provides all 
libraries used to implement the intermediate code 
from the previous layer. The user is free to use the 
provided libraries and language facilities to modify 
the program in any way. 


Users can move down through the different abstractions, 
selecting a suitable layer based on the desired perfor- 
mance of their applications or on how comfortable they 
are with a given abstraction. 


Several critical aspects of the framework are demon- 
strated by the use of hook methods for introducing 
application—specific functionality. First, the parallel 
structural code cannot be modified in the Patterns Layer, 
which allows us to make correctness guarantees about 
the parallel structure of the program. Second, it al- 
lows users to concentrate on implementing their appli- 
cations without worrying about the structure of the par- 
allelism. Also, by ensuring that the structural code pro- 
vides proper synchronization around hook method invo- 
cations, the user can write sequential code without hay- 
ing to take into account the parallelism provided by the 
framework. Lastly, we provide suitable default imple- 
mentations of the hook methods in the abstract classes of 
the framework. These default methods permit the frame- 
work to be compiled and executed immediately after it is 
generated, without implementing any of the hook meth- 
ods. The program will execute with a simple default be- 
haviour. This provides users with a correct implemen- 
tation of the structure of the pattern before they begin 
adding the hook methods and tuning the program. 


The CO2P3S system currently supports several parallel 
design patterns through its templates, which also use a 
group of sequential design patterns. These patterns are 
written specifically for solving design problems in the 
parallel programming domain. The patterns used in this 
paper are: 


Method Sequence This new pattern supports the cre- 
ation of phased algorithms by invoking an ordered 
sequence of methods on a Facade [4] object. 


Distributor This new pattern supports data—parallel 
style computations by forwarding a method from 
a parent object to a fixed number of child objects, 
all executing in parallel. 


Two-Dimensional Mesh This pattern _ supports 
iterative computations for a rectangular two- 
dimensional mesh, where a surface is decomposed 
into a set of regular, rectangular partitions. 


A more detailed description of these patterns, along with 
the other patterns supported by CO2P3S, can be found in 
our pattern catalogue [7]. 


The context of our patterns is the architecture and pro- 
gramming model of CO2P3S. Therefore, we have made 
several changes to the structure of the basic design 
pattern documentation [4]. Since our patterns are for 
the parallel programming domain, the pattern descrip- 
tion includes concurrency-related specifications, such as 
synchronization and the creation of concurrent activity. 
From the pattern, we produce a CO2P3S pattern template 
and an example framework, which we also describe in 
the pattern document. While these two sections are not 
strictly pattern specifications, they illustrate the use of 
the pattern while documenting the CO2P3S templates 
and frameworks. So while we have added CO2P3S— 
specific sections to our pattern descriptions, we have 
preserved the instructional nature of design patterns. 


We note that our patterns still represent abstract design 
solutions. From the abstract pattern, we fix some of the 
design tradeoffs and allow the user to specify the remain- 
ing tradeoffs using parameters. This intermediate form, 
which is less abstract than a pattern but less concrete 
than a framework, is called a pattern template. We use 
a fully specified pattern template to generate parametri- 
cally related families of frameworks, with each frame- 
work in the family implementing the same basic pat- 
tern structure but specialized with the user—supplied pa- 
rameters. However, the pattern templates are a design 
construct used to specify a framework; the templates do 
not contain nor provide code themselves. The generated 
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frameworks provide reusable and extendible code imple- 
menting a specific version of the pattern. These frame- 
works take into account our decomposition of a program 
into its sequential and parallel aspects. The ability of the 
user to modify a framework is dictated by the level of 
abstraction that the user is using. 


In creating the pattern templates, we try to fix as few of 
the design tradeoffs as possible. However, in fixing any 
part of the design, we will create situations in which it 
will be necessary to modify certain elements of a pro- 
gram to fit within the limitations we impose. For the 
example programs that follow, we discuss the design in 
terms of the patterns, and discuss the implementation in 
terms of the pattern templates and the frameworks. The 
implementation of the programs may need to be modi- 
fied from the initial design to account for the fixed design 
tradeoffs in the pattern templates and frameworks. 


3 Example Applications 


In this section, we detail the design and implementa- 
tion of two example programs. These examples demon- 
strate the applicability of our patterns to application de- 
sign, and show how the pattern templates and generated 
frameworks can be used to implement the programs. The 
first example uses a reaction—diffusion texture genera- 
tion program to show how we have reworked the Mesh 
pattern in an object-oriented fashion. The second ex- 
ample, parallel sorting by regular sampling, uses the 
Method Sequence and Distributor patterns. We also use 
this example to highlight the temporal nature of these 
two patterns and to show the composition of CO2P3S 
frameworks. The implementation of these two examples 
demonstrates that the Patterns Layer of CO2P3S can be 
used to write parallel programs with good performance. 
It also shows the benefits of a high-level object-oriented 
tool for parallel programming. 


3.1 Reaction—Diffusion Texture Generation 


Reaction—diffusion texture generation [17] can be de- 
scribed as two interacting LaPlace equations that sim- 
ulate the reaction and diffusion of two chemicals (called 
morphogens) over a two—dimensional surface. This sim- 
ulation, started with random concentrations of each mor- 
phogen on the surface, can produce texture maps that 
approximate zebra stripes. The problem uses Gauss— 
Jacobian iterations (each iteration uses the results from 
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the previous iteration to compute the new values, with- 
out any successive overrelaxation) and is solved using 
straightforward convolution. The simulation typically 
executes until the total change in morphogen concentra- 
tion over the surface falls below some threshold. We use 
a fully toroidal mesh, allowing the morphogens to wrap 
around the edges of the surface. The toroidal boundary 
conditions ensure that the resulting texture can be tiled 
on a display without any noticeable edges between tiles. 


3.1.1 Parallel Implementation 


An obvious approach to this problem is to decompose 
the two-dimensional surface into regular rectangular re- 
gions and to work on these regions in parallel. Our so- 
lution must be more complex because we cannot evalu- 
ate each region in isolation. Each point on the surface 
needs the concentration values from its neighbours to 
calculate its new value, so points on the edge of a region 
need data from adjoining regions. Thus, we require an 
exchange of region boundaries between iterations. Fur- 
ther, Gauss—Jacobian iterations introduces dependencies 
between iterations that need to be addressed by our par- 
allel implementation of this problem. 


3.1.2 Implementation in CO,P3S 


Design and Pattern Specification The first step in im- 
plementing a CO2P3S program is to analyze the prob- 
lem and select the appropriate design pattern. This pro- 
cess still represents the bottleneck in the design of any 
program. Given the requirements of our problem, the 
Mesh pattern is a good choice for several reasons. The 
problem is an iterative algorithm executed over a two— 
dimensional surface. Further, our approach is to decom- 
pose the surface into regular rectangular regions, a de- 
composition that is automatically handled by the frame- 
work for the Mesh pattern template. 


The Mesh pattern consists of two types of objects, a col- 
lector object and a group of mesh objects. The structure 
of the Mesh is given in Figure 1. The collector object 
is responsible for creating the mesh objects, distributing 
the input state over the mesh objects, controlling the ex- 
ecution of the mesh objects, and collecting the results of 
the computation. The mesh objects implement an iter- 
ative mesh computation, a loop that exchanges bound- 
aries with neighbouring mesh elements and computes 
the new values for its local mesh elements. The iter- 
ations continue until all the mesh objects have finished 
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Figure 1: An object diagram for the structure of the Mesh pattern. The arcs indicate object references. The 
Collector object has references to all of the instances of MeshObject, but the arcs are omitted for clarity. 
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Figure 2: The method invocations in the Mesh implementation of the reaction—diffusion problem. The methods are 
executed in the order in which they are numbered. Calls 4 through 10 are the main loop of the mesh computation and 


are repeated until the computation has completed. 


computing, when all of the morphogen concentrations in 
all of the mesh objects have stabilized in our example. 


This pattern requires synchronization because of depen- 
dencies between iterations. Specifically, we need to en- 
sure that the boundary elements for each mesh element 
have all been computed before exchanging them, to pre- 
vent a neighbour from using incorrect data in its next 
iteration. We also require synchronization for determin- 
ing the termination condition, since all of the mesh ob- 
jects must cooperate to decide if they have finished. This 
synchronization is necessarily pessimistic to handle the 
general case. Individual programs using this pattern may 
remove any unnecessary synchronization. 


The Mesh pattern is not specific to the reaction—diffusion 
example. It can be used to implement other finite ele- 


ment calculations and image processing applications. 


Once the pattern has been selected, the user se- 


lects the corresponding pattern template and fills in 
its parameters. For all pattern templates, the names 
of both the abstract and concrete classes for each 
class in the resulting framework are required. In 
CO2P3S, the user only specifies the concrete class 
names; the corresponding abstract class names are 
prepended with “Abstract”. For our Mesh example, 
we specify RDCollector for the collector class and 
RDSimulation for the mesh class, which also cre- 
ates the abstract classes AbstractRDCollector 
and AbstractRDSimulation. We further spec- 
ify the type of the two-dimensional array that will 
be distributed over the mesh objects, which is 
MorphogenPair for the reaction—diffusion example. 
Finally, we specify the boundary conditions of the mesh, 
which is set to fully toroidal (where each mesh ob- 
ject has all four neighbours by wrapping around the 
edges of the mesh, as shown in Figure 1). We can 
select other boundary conditions (horizontal—toroidal, 
vertical—toroidal, and non-toroidal); we will see the ef- 
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fects of different conditions later in this section. The 
dimensions of the mesh are specified via constructor ar- 
guments to the framework. The input data is automati- 
cally block—distributed over the mesh objects, based on 
the dimensions of the input data and the dimensions of 
the mesh. 


Using the Framework From the pattern template 
specification, the CO2P3S system generates a frame- 
work implementing the specific instance of the Mesh 
pattern given the pattern template and its parameters. 
The framework consists of the four classes given above 
with some additional collaborator classes. The sequence 
of method calls for the framework is given in Figure 2. 


Once the framework is generated, the user can imple- 
ment hook methods to insert application—specific func- 
tionality at key points in the structure of the framework. 
The selection of hook methods is important since we en- 
force program correctness at the Patterns Layer by not 
allowing the user to modify the structural code of the 
framework. The user implements the hook methods in 
the concrete class by overriding the default method pro- 
vided in the abstract superclass in the framework. If the 
hook method is not needed in the application, the default 
implementation can be inherited. 


To demonstrate the use of the hook methods, we show 
the main execution loop of the Mesh framework, as gen- 
erated by CO2P3S, in Figure 3. The hook methods are 
indicated in bold italics. There are hook methods for 
both the collector and mesh objects in the Mesh frame- 
work. For the collector, the only hook method is: 


reduce() This method, invoked after the mesh com- 
putation has finished, allows the user to perform a 
reduction on the results. By default, this method re- 
turns the input array updated with the results of the 
mesh computation. 


The hook methods for the mesh objects are: 


outerPrefix() This method is invoked before the 
mesh computation is started. It can be used for ini- 
tializing the mesh object. By default, this method 
does nothing. 


notDone() This method is used to determine if 
the mesh object has finished its local computa- 
tion. Note that the computation finishes only 
when all mesh objects have finished. By de- 
fault, this method returns false, indicating that 
the mesh object has finished its computation. 
This method is not directly invoked from the 


5th USENIX Conference on Object-Oriented Technologies and Systems (COOTS '99) 


meshMethod() method. It is invoked indi- 
rectly from the notDoneCondition () method, 
which uses the result of this hook method in the 
setTerminationFlag() method to set the 
termination status of the current mesh object and 
then uses checkTerminationFlags () tocal- 
culate the global termination condition. 


innerPrefix() This method is invoked first in the 
mesh computation loop, before the boundary ex- 
change. It can be used for any precomputations re- 
quired in the loop. By default, this method does 
nothing. 


The operation methods These methods implement the 
mesh computation. There are nine methods that 
may be used depending on the boundary condi- 
tions. These are described below. By default, these 
methods do nothing. They are invoked indirectly 
from the operate () method. These methods re- 
place the innerSuffix() method. 


outerSuffix() This method is invoked after the 
mesh computation has finished but before results 
are passed back to the collector object. It can be 
used for any post-processing or cleanup required 
by the mesh object. By default, this method does 
nothing. 


The implementation of the operate() method called 
in the code from Figure 3 invokes a subset of the nine 
operation methods given in Figure 4. The boundary 
conditions and the position of the mesh object deter- 
mine which of the operation methods are used. For in- 
stance, consider the two meshes in Figure 5. For the 
fully toroidal mesh in Figure 5(a), there are no boundary 
conditions. Thus, only the interiorNode() hook 
method is invoked. For the horizontal—toroidal mesh in 
Figure 5(b), there are three different cases, one for each 
row. The mesh objects in the different rows, from top to 
bottom, invoke the topEdge(), interiorNode(), 
and bot tomEdge () hook methods for the mesh oper- 
ation. The implementation of the operate () method 
uses a Strategy pattern [4], where the strategy corre- 
sponds to the selected boundary conditions. This strat- 
egy is a collaborator class generated with the rest of the 
framework. It is also responsible for setting the neigh- 
bours of the mesh elements after they are created, us- 
ing the setNeighbours () method (from Figure 2). 
At the Patterns Layer, the user does not modify this 
class. Each mesh object automatically executes the cor- 
rect methods, depending on its location in the mesh. 


Now we implement the reaction—diffusion texture gen- 
eration program using the generated Mesh framework. 
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public void meshMethod() 


{ 


this.outerPrefix() ; 


while(this.notDoneCondition()) { 


this.innerPrefix() ; 


MorphogenPair[][] leftState = 


left.getState() ; 


MorphogenPair[][] rightState = right.getState() ; 


MorphogenPair[][] upState 


= up.getState() ; 


MorphogenPair[][] downState = down.getState() ; 


this.operate(leftState, 
} /* while */ 
this.outerSuffix() ; 


rightState, 


upState, downState) ; 


this.getCollector().setResult(this.getState()) ; 


} /* meshMethod */ 





Figure 3: The main execution loop of a mesh. Hook methods are shown in bold italics. 


topLeftCorner(MorphogenPair[][] right, MorphogenPair[][] down) ; 


topEdge(MorphogenPair[][] right, 
MorphogenPair[][] down) 

topRightCorner (MorphogenPair[][] 

leftEdge(MorphogenPair[][] right, 
MorphogenPair[][] down) 


MorphogenPair[][] left, 


left, MorphogenPair[][] down) ; 
MorphogenPair[][] up, 


interiorNode (MorphogenPair[][] left, MorphogenPair[][] right, 


MorphogenPair[][] up, 
rightEdge (MorphogenPair[][] left, MorphogenPair[][] up, 

MorphogenPair[][] down) 
bottomLeftCorner (MorphogenPair[][] right, MorphogenPair[][] up) ; 
bottomEdge(MorphogenPair[][] left, MorphogenPair[][] right, 

MorphogenPair[][] up) 
bottomRightCorner (MorphogenPair[][] left, MorphogenPair[][] up) ; 





MorphogenPair[][] down) ; 


Figure 4: The hook methods for the mesh operations. 


(b) A horizontal— 
toroidal mesh. 


(a) A fully 
toroidal mesh. 


Figure 5: Example mesh structures. 


First, we note that we do not need a reduction method, 
as the result of the computation is the surface com- 
puted by each region. Also, we do not require the 
outerPrefix() or the outerSuff£ix() methods. 
The innerPrefix() method is required because we 
have chosen to keep two copies of each morphogen ar- 
ray, a read copy for getting previous data and a write 
copy for update during an iteration. This scheme uses 
additional memory, but obviates the need to copy the 


array in each iteration. Each iteration must alternate 
between using the read and write copies, which is ac- 
complished by reversing the references to the arrays in 
the innerPrefix() method. Given our fully toroidal 
mesh, we only need to implement the mesh operation in 
the interiorNode() hook method. 


The notDone() method checks the local mesh state 
for convergence. Each mesh object returns a Boolean 
flag indicating if it has finished its local computation, 
and these flags are used to determine if all of the mesh 
objects have finished. The pattern template fixes the 
flags as Booleans, which does not allow the global ter- 
mination conditions given in Section 3.1 to be imple- 
mented. Instead, our simulation ends when the change 
in morphogen concentration in each cell falls below a 
threshold. Although this restriction forced us to modify 
this program, it simplifies the pattern template specifica- 
tion and reduces the number of hook methods. This ter- 
mination behaviour can be modified at the Intermediate 
Code Layer if a global condition must be implemented. 
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After completing the specification and implementation 
of the Mesh framework, the user must implement the 
code to instantiate the objects and use the framework. 
The Java code is given in Figure 6, where we use con- 
stants for the width and height of the data and the mesh, 
but these values could be obtained dynamically at run— 
time from a file or from the user. 


3.1.3. Evaluation 


In this section, we evaluate the Patterns Layer of 
CO2P3S using the reaction—diffusion texture generator. 
Our basis for evaluation is the amount of code written 
by the user to implement the parallel program and the 
run-time performance. These results are based on a Java 
implementation of the problem. 


In the following discussion, we do not include any com- 
ments in counts of lines of code. All method invocations 
and assignments are considered one line, and we count 
all method signatures (although the method signatures 
for the hook methods are generated for the user). 


The sequential version of the reaction—diffusion pro- 
gram was 568 lines of Java code. The complete parallel 
version, including the generated framework and collab- 
orating classes, came to 1143 lines. Of that 1143 lines, 
the user wrote (coincidentally) 568 lines, just under half. 
However, 516 lines of this code was taken directly from 
the sequential version. This reused code consisted of the 
classes implementing the morphogens. This morphogen 
code had to be modified to use data obtained from the 
boundary exchange, whereas the sequential version only 
used local data. This modification required one method 
to be removed from the sequential version and several 
methods added, adding a total of 52 lines of code to 
the application. The only code that could not be reused 
from the sequential version was the mainline program. 
In addition, the user was required to implement the hook 
methods in the Mesh framework. These methods were 
delegated to the morphogen classes and required only a 
single line of code each. 


We note that this case is almost optimal; the structure of 
the simulation was parallelized without modifying the 
computation. Also, the structure of the parallel program 
is close to the sequential algorithm, which is not always 
the case. These characteristics allowed almost all of the 
sequential code to be reused in the parallel version. 


This program was executed using a native-threaded Java 
implementation from SGI (Java Development Environ- 
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ment 3.1.1, using Java 1.1.6). The programs were com- 
piled with optimizations on and executed on an SGI 
Origin 2000 with 42 195MHz R10000 processors and 
10GB of RAM. The Java virtual machine was started 
with 512MB of heap space. Results were collected for 
programs executed with the just-in-time (JIT) compiler 
enabled and then again with the JIT compiler disabled. 
Disabling the JIT compiler effectively allows us to ob- 
serve how the frameworks behave on problems with in- 
creased granularity. Speedups are based on wall—clock 
time and are compared against a sequential implementa- 
tion of the same problem executed using a green threads 
virtual machine. Note that the timings are only for the 
mesh computation; neither initialization of the surface 
nor output is included. The results are given in Table 1. 


With the JIT enabled, the speedups for the program tail 
off quickly. As we add more processors, the granularity 
of the mesh computation loop falls and the synchroniza- 
tion hampers performance. The non—JIT version shows 
good speedups, scaling to 16 processors, showing the ef- 
fects of increased granularity. 


From this example, we can see that our framework pro- 
motes the reuse of sequential code; almost all of the mor- 
phogen code from the sequential program was reused in 
the parallel version. This reuse allowed the parallel ver- 
sion of the problem to be implemented with only a few 
new lines of code (52 lines). The performance of the 
resulting parallel application is acceptable with the JIT 
enabled, although the granularity quickly falls. 


3.2 Parallel Sorting by Regular Sampling 


Parallel sorting by regular sampling (PSRS) is a paral- 
lel sorting algorithm that provides a good speedup over 
a broad range of parallel architectures [13]. This algo- 
rithm is explicitly parallel and has no direct sequential 
counterpart. Its strength lies in its load balancing strat- 
egy, which samples the data to generate pivot elements 
that evenly distribute the data to processors. 


The algorithm consists of four phases, illustrated in Fig- 
ure 7. Each phase must finish before the next phase 
starts. The phases, executed on p processors, are: 


1. In parallel, divide the input array into p contiguous 
lists and sort each list. Select p — 1 evenly spaced 
sample elements from each sorted list. 


2. Select a designated processor to sort the entire set 
of sample elements. Then, choose p — 1 evenly 
spaced pivot values from the sample set. 
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public static void main(String[] argv) 


{ 


MorphogenPair[][] data = 
Main.dataHeight) ; 
RDCollector collector = 


Main.meshHeight, data, 


/* Start the execution of the simulation. 


collector.Execute() ; 


/* Wait for and get the results. 


data = 
} /* main */ 


Main.initializeData(Main.dataWidth, 


new RDCollector(Main.meshWidth, 


Main.dataWidth, Main.dataHeight) ; 
my 


ef 


(MorphogenPair[][]) collector.getResults() ; 





Figure 6: The code that starts the reaction—diffusion simulation using the Mesh framework. 







Problem 
Size 


1024 x 1024 





1.96 
11150 sec 


B79 
5886 sec 


6.93 
3162 sec 





JIT disabled 


1767 sec 









JIT enabled 


2.88 
1413 sec 


4.58 
887 sec 


4.49 
906 sec 


12.39 1.61 


2519 sec 











Table 1: Speedups for the reaction-diffusion example. Wall clock times, rounded to the second, are also provided. 


3. In parallel, partition each sorted list into p sublists 
using the pivot values. 


4. In parallel, merge the partitions and store the results 
back into the array. 


3.2.1 Parallel Implementation 


The parallelism in this problem is clearly specified in 
the algorithm from the previous section. We require a 
series of phases to be executed, where some of those 
phases use a set of processors computing in parallel. For 
the parallel phases, a fixed number of processors execute 
similar code on different portions of the input data. 


3.2.2 Phased Parallel Design Patterns 


An interesting aspect of the PSRS algorithm is that the 
parallelism changes in different phases. The first and 
third phases are similar. The second phase is sequential. 
Finally, the last phase consists of two subphases (iden- 
tified below), where each subphase has its own concur- 
rency requirements. We need to ensure that this tempo- 
ral relationship between the patterns can be expressed. 
In contrast, other parallel programming systems require 
the user to build a graph that is the union of all the possi- 
ble execution paths that are used and leave it to the user 
to ensure they are used correctly. Alternately, the user 
must use a pipeline solution, where each pipe stage im- 
plements a phase (as in Enterprise [10]). However, the 


real strength of a pipeline lies in algorithms where multi- 
ple requests can be concurrently executing different op- 
erations in different pipe stages. Further, a pipeline sug- 
gests that a stream of data (or objects in an OO pipeline) 
is being transformed by a sequence of operations. A 
phased algorithm may transform its inputs or generate 
other results, depending on the algorithm. 


A further temporal aspect of this algorithm is when to 
use parallelism. Sometimes we would like to use the 
same group of objects for both parallel and sequential 
methods. For instance, some methods may not have 
enough granularity for parallel execution. Sometimes, 
as in the second phase of PSRS, we may need to execute 
a sequential operation on data contained in a parallel 
framework. This kind of control can be accommodated 
by adding sequential methods in the generated frame- 
work. These methods would use the object structure 
without the concurrency. In implementing these meth- 
ods, the user must ensure that they will not interfere with 
the execution of any concurrent method invocations. 


3.2.3. Implementation in CO2P3S 


Design and Pattern Specification 


Method Sequence In the PSRS algorithm, the phases 
stand out as the first concern. The algorithm suggests 
four phases. We note, though, that the last phase con- 
tains a data dependency. We must ensure that the merg- 
ing is complete before we can begin writing results back 
to the original array. Otherwise, the merging phase can 
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Figure 7: An example of parallel sorting by regular sampling. 


read incorrect data. From this observation, we rewrite 
the fourth phase of PSRS as two subphases: 


4.1 Merge the partitions in a temporary buffer. 


4.2 Store the results in the original array by copying the 
temporary buffer. 


To implement a series of phases in a program, we can 
use the Method Sequence pattern. In our example, we 
have identified two different sets of phases so we apply 
this pattern twice. The first application of the Method 
Sequence pattern implements the four phases from the 
previous section. We apply the pattern again in the im- 
plementation of the last phase, to execute the phases 
identified above. Alternately, we could rewrite the orig- 
inal algorithm as five phases and apply the pattern once. 
However, our solution is consistent with the original al- 
gorithm and also helps us demonstrate the composability 
of our frameworks in the next subsection. 


The Method Sequence pattern is a specialization of the 
Facade pattern [4] that adds ordering to the methods of 


5th USENIX Conference on Object-Oriented Technologies and Systems (COOTS '99) 


the Facade. It consists of two objects, a sequence ob- 
ject and an instance of the Facade pattern [4]. The se- 
quence object holds an ordered list of method names to 
be invoked on the Facade object. These methods have 
no arguments or return values. The Facade object sup- 
plies a single entry point to the objects that collaborate 
to execute the different phases. The Facade typically 
delegates its methods to the correct collaborating object, 
where these methods implement the different phases for 
the sequence object. Each phase is executed only af- 
ter the previous phase has finished. The Facade object 
is also responsible for keeping any partial results gener- 
ated by one phase and used in another (such as the pivots 
generated in the second phase of PSRS and used in the 
third phase). We include the Facade object for the gen- 
eral case of the Method Sequence pattern, where there 
may be different collaborating objects implementing dif- 
ferent phases of the computation. Without the Facade 
object, the sequence object would need to manage both 
the method list and the objects to which the methods are 
delegated, making the object more complicated. 
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The Method Sequence pattern has other uses beyond this 
example. For example, it is applicable to programs that 
can be written as a series of phases, such as LU factor- 
ization (a reduction phase and a back substitution phase). 


After designing this part of the program, the user selects 
the Method Sequence pattern template and fills in the pa- 
rameters to instantiate the template. For this pattern tem- 
plate, the user specifies the names of the concrete classes 
for the sequence and Facade classes, and an ordered list 
of methods to be invoked on the Facade object. Again, 
the abstract superclasses for both classes have “Abstract” 
prepended to the concrete class names. 


Distributor Now we address the parallelism in the 
first, third, and last phases. Each of these phases re- 
quire a fixed amount of concurrency (p processors). If 
we attempt to vary the number of processors for differ- 
ent phases, we will generate different data distributions 
that will cause problems for the operations. Further, the 
processors operate on the same region of data in the first 
and third phases. If we can distribute the data once to a 
fixed set of processors, we can avoid redistribution costs 
and preserve locality. The last phase requires a redistri- 
bution of data, but again it must use the same number of 
processors as used in the previous parallel phases. Sim- 
ilarly, the two subphases for the last phase share a com- 
mon region, the temporary buffer. It is also necessary for 
the concurrency to be finished at the end of each phase 
because of the dependencies between the phases. 


Given these requirements, we apply the Distributor pat- 
tern. This pattern provides a parent object that internally 
uses a fixed number of child objects over which data 
may be distributed. In the PSRS example, the number 
of children corresponds to the number of processors p. 
All method invocations are targeted on the parent object, 
which controls the parallelism. In this pattern, the user 
specifies a set of methods that can be invoked on all of 
the children in parallel. The parent creates the concur- 
rency for these methods and controls it, waiting for the 
threads to finish and returning the results (an array of 
results, one element per child). 


The Distributor pattern can also be used in other pro- 
grams. It was used three times in the PSRS algorithm, 
and can be applied to any data—parallel problem. 


After the design stage, the user selects the Distributor 
pattern template and instantiates the template. For the 
Distributor pattern template, the user specifies the names 
of the concrete classes for the parent and child classes 
(again, the abstract classes are automatically named) and 
a list with the following fields: 


1. The name of the method that should be invoked 
concurrently on the child objects. 


2. The return type for the child implementation of the 
method. The parent returns an array of this type, 
unless the type is void. 


3. The arguments to the parent implementation of the 
method. For one-dimensional array parameters, 
the distribution of the parameter over the child ob- 
jects must also be specified. Currently supported 
distributions are pass through, one element per 
child, striped distribution, block distribution, and 
neighbour distribution (child 7 gets a two element 
array of elements z and 7 + 1 from the original ar- 
ray). All other arguments are passed to the children 
directly (pass through distribution). 


Note the third field of the tuples allows for one— 
dimensional array arguments to be automatically and 
correctly distributed to the child objects. For instance, 
we use the neighbour distribution in the first part of the 
fourth phase to distribute the partitioned elements to the 
children of the Merger Container class. We also 
distribute an array of indices, one per child, in the sec- 
ond part of the fourth phase so each child knows where it 
should merge its sorted partition. The code to distribute 
these arguments is part of the framework for the pattern 
and is not written by the user. The last parameter to the 
Distributor template, the number of children, is specified 
as a constructor argument in the generated framework. 


Using the Framework Based on the specification of 
the pattern templates for this program, the structure of 
the framework for the PSRS program is given in Fig- 
ure 8. The two uses of the Method Sequence framework 
are the Sorter Sequence and Sorter Facade 
class pair, and the Merger Sequence and Merger 
Facade pair. The two uses of the Distributor frame- 
work are the Data Container and Data Child 
pair, and the Merger Container and Merger 
Child pair. When generated, the framework does not 
contain the necessary references for composing the dif- 
ferent frameworks; creating these references is covered 
in Section 3.2.5. However, any references needed in a 
particular framework are supplied in the generated code. 
For instance, the abstract sequence class has a reference 
to the Facade object in the Method Sequence framework. 
The actual object is supplied by the user to the construc- 
tor for the sequence class. 


Both the Method Sequence and the Distributor frame- 
works have different hook methods that can be imple- 
mented. The sequence of method calls is shown in Fig- 
ure 8. For the sequence object in the Method Sequence 
framework, these hook methods are: 
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2. prefix() 
17. suffix() 


Sequence . sortPartitions() 


5. getPivots() 
7. partitionData() 
| 9. mergeData 


1. executeSequence() 













[RN aa Sent nga spas ae 
| | 4a. sortPartitions() 4b. sortPartitions() | 
6a. getPivots() Data 6b. getPivots() | 
Container 

| 


i ”~”~””””””””:C«SR initiate) | 
13b. mergePartitions() 
15b. mergeFinalArray() 









13a. mergePartitions() 
15a. mergeFinalArray() 
AN 








12. mergePartitions() 
14, mergeFinalArrayO 






11. prefix: 
16. eatin 


Figure 8: The structure of the PSRS program. The methods are executed in the order in which they are numbered. 


prefix() This method is invoked before the methods 
in the sequence are executed. It can be used for 
any initialization required by either the sequence 
or Facade objects. For instance, in the Sorter 
Sequence class this method can be used to gen- 
erate the data to be sorted. 


suffix() This method is invoked after the methods 
in the sequence are executed. It can be used for 
any cleanup or postprocessing required by the pro- 
gram. For instance, in the Sorter Sequence 
class this method may verify the sort. 


The hook methods for the Facade object are the se- 
quence methods, the parameterless methods in the list 
of method names. These methods implement the differ- 
ent phases of the application. A phase has finished when 
its associated method returns, so all concurrent activity 
generating results used in another phase must be com- 
plete. Any partial results are stored in the Facade object. 


For the Distributor framework, the hook methods are 
the child implementations of the methods specified in 
the last pattern template parameter. Each child operates 
independently, without reference to other child objects. 
These methods can operate on any state that has been 
distributed across the child objects or can be used to in- 
voke methods on any distributed arguments in parallel. 
The parent object provides the structural code for this 
pattern, and has no hook methods. To assist the user, the 
signatures for the child methods are automatically gen- 
erated and included in the concrete child class. 
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3.2.4 Evaluation 


Since PSRS is a parallel algorithm, there is no sequen- 
tial version. Therefore, we chose a sequential quicksort 
algorithm as a baseline for comparison. The sequential 
sorting program was 102 lines of Java code, used to ini- 
tialize the data and verify the sort. The sorting algo- 
rithm was the quicksort method from the JGL library [9], 
which is 255 lines of Java code. The PSRS program, in- 
cluding the framework and collaborator classes, totaled 
1252 lines of code (not including the JGL sorting code), 
700 of which are user code. 414 lines of the user code are 
in the three classes Data Child, Merger Child, 
and Data Container. These classes contain most 
of the code for the application (the Data Container 
object is the single processor used for the second phase). 
Of the remaining classes, the two Facade classes and 
mainline are the largest. However, the methods in these 
classes consist mainly of accessors and, particularly in 
the two Facade objects, one line methods for delegating 
a method. The mainline also interprets command line 
arguments, and creates the Sorter Sequence object 
and the container for the data to be sorted. 


In contrast to the reaction—diffusion example, the PSRS 
algorithm cannot make much use of the code from the 
sequential version. The problem is that the best parallel 
algorithm is not necessarily a parallel version of the best 
sequential algorithm. For instance, the performance of 
parallel quicksort peaks at a speedup of 5 to 6 regardless 
of the number of processes [13]. In these cases, writ- 
ing the parallel algorithm requires more effort, as we see 
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with this problem. Nevertheless, the framework supplies 
about half of the code automatically, including the error— 
prone concurrency and synchronization code. 


The performance results for PSRS, collected using the 
same environment given in Section 3.1.3, are shown in 
Table 2. These timings are only for sorting the data. 
Data initialization and sort verification are not included. 


Unlike the reaction—diffusion example, both JIT and 
non-JIT versions of the PSRS program show good 
speedups, scaling to 16 processors. The principle rea- 
son for this improvement is that there are fewer synchro- 
nization points in PSRS; five for the entire program ver- 
sus two per iteration of the mesh loop. In addition, the 
PSRS algorithm does more work between synchroniza- 
tion points, even with the smaller data set, reducing the 
overall cost of synchronization further. 


From this example, we can see that CO2P3S also sup- 
ports the development of explicitly parallel algorithms. 
The principle difficulty in implementing this kind of par- 
allel algorithm is that little sequential code can be used 
in the parallel program, forcing the user to write more 
code (as we can see by the amount of user code needed 
for PSRS). Support for explicitly parallel algorithm de- 
velopment is crucial because a good parallel algorithm 
is not always derived from the best sequential algorithm. 


3.2.5 Composition of Frameworks 


Unlike the reaction—diffusion program, the PSRS exam- 
ple used multiple design pattern templates in its imple- 
mentation, and required the resulting frameworks to be 
composed into a larger program. We explain briefly 
how this composition is accomplished, which also pro- 
vides insights on how the user can augment the gener- 
ated framework at the Patterns Layer. 


In CO2P3S frameworks, composition is treated as it is in 
normal object-oriented programs, by delegating meth- 
ods to collaborating objects. Note that the framework 
implementing a design pattern is still a group of ob- 
jects providing an interface to achieve some task. For 
instance, in the code in Figure 6, the collector object 
provides an interface for the user to start the mesh com- 
putation and get the results, but the creation and control 
of the parallelism is hidden in the collector. If another 
framework has a reference to a collector object, it can 
use the Mesh framework as it would any other collab- 
orating object, providing framework composition in a 
way compatible with object-oriented programming. 
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To compose frameworks in this fashion, the frameworks 
must be able to obtain references to other collaborating 
frameworks. This can be done in three ways: passing the 
references as arguments to a method (normal or hook) 
and caching the reference, instantiating the collaborat- 
ing framework in the framework that requires it (in a 
method or constructor), or augmenting constructors with 
new arguments. The first two ways are fairly straight- 
forward. The second method of obtaining a reference 
is used in the third phase of PSRS since the Merger 
Container object cannot be created until the pivots 
have been obtained from the second phase. 


The third method of obtaining a reference, augmenting 
the constructor for a framework, requires more discus- 
sion as it is not always possible. We should first note 
that this option is open to users because the CO2P3S 
system requires the user to create some or all of the ob- 
jects that make up a given framework (as shown in Fig- 
ure 6). In general, users can augment the constructors of 
any objects they are responsible for creating. For the 
Mesh framework, the user can augment the construc- 
tor for the collector object. However, the added state 
can only be used to influence the parallel execution of a 
framework at the Patterns Layer if the class with the aug- 
mented constructor also has hook methods the user can 
implement. Otherwise, the user has no entry point to the 
structural code and the additional state cannot be used in 
the parallel portion of that framework. For instance, the 
user can augment the constructor for the parent object 
in a Distributor framework, but since the parent has no 
hook methods this state cannot influence the parallel be- 
haviour of that object. However, new state can always be 
used in any additional sequential methods implemented 
in the framework. 


4 Related Work 


We examine work related to the pattern, pattern tem- 
plate, and framework aspects of the CO2P3S system. 


Patterns There are too many concurrent design pat- 
terns to list them all. Two notable sources of these pat- 
terns are the ACE framework [11] and the concurrent de- 
sign pattern book by Lea [6]. This work provides more 
patterns and attempts to provide a development system 
for pattern—based parallel programming. Specifically, 
our pattern templates and generated frameworks auto- 
mate the use of a set of supported patterns. 


Pattern Templates There are many graphical parallel 
programming systems, such as Enterprise [10, 14], DP- 
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Number of JIT disabled JIT enabled 
Elements 


6.74 
395 sec 


1.76 
1519 sec 


3.43 
777 sec 





12,500,000 


12.02 
222 sec 


3.65 
269 sec 


12.65 
78 sec 


1.73 
567 sec 


7.09 
138 sec 


Table 2: Speedups for the PSRS example. Wall clock times, rounded to the second, are also provided. 


nDP [15, 14], Mentat [5], and HeNCE [2]. Enterprise 
provides a fixed set of templates (called assets) for the 
user, but requires the user to write code to correctly im- 
plement the chosen template, without checking for com- 
pliance. Further, applications are written in C, not an 
object-oriented language. Mentat and HeNCE do not 
use pattern templates, but rather depict programs visu- 
ally as directed graphs, compromising correctness. DP- 
nDP is similar to Mentat except that nodes in the graph 
may contain instances of design patterns communicating 
using explicit message passing. In addition, the system 
provides a method for adding new templates to the tool. 


PSL [1] provides a language solution to pattern—based 
programming, providing a set of design patterns that can 
be composed to create larger programs. Communication 
is explicit and type—checked at compile-time. However, 
new languages impose a steep learning curve on new 
users. Also, the language is not object-oriented. 


Frameworks  Sefika et al. [12] have proposed a model 
for verifying that a program adheres to a specified design 
pattern based on a combination of static and dynamic 
program information. They also suggest the use of run— 
time assertions to ensure compliance. In contrast, we 
ensure adherence by generating the code for a pattern. 
We do not include assertions because we allow users to 
modify the generated frameworks at lower levels of ab- 
straction. These modifications can be made to increase 
performance or to introduce a variation of a pattern tem- 
plate that is not available at the Patterns Layer. 


In addition to verifying programs, Sefika et al. also sug- 
gest generating code for a pattern. Budinsky et al. [3] 
have implemented a Web-based system for generating 
code implementing the patterns from Gamma et al. [4]. 
The user downloads the code and modifies it for its in- 
tended application. Our system generates code that al- 
lows the user the opportunity to introduce application— 
specific functionality without allowing the structure of 
the framework to be modified until the performance— 
tuning stage of development. This allows us to enforce 
the parallel constraints of the selected pattern template. 


Each of the PPSs mentioned above differ with respect 
to openness. Enterprise, Mentat, HeNCE, and PL fail 
to provide low-level performance tuning. However, En- 


terprise provides a complete set of development and de- 
bugging tools in its environment. DPnDP provides per- 
formance tuning capabilities by allowing the program- 
mer to use the low-level libraries used in its implemen- 
tation. Instead, we provide multiple abstractions for per- 
formance tuning, providing the low-level libraries only 
at the lowest level of abstraction. 


5 Conclusions and Future Work 


This paper presented some of the parallel design patterns 
and associated frameworks supported by the CO2P3S 
parallel programming system. We demonstrated the util- 
ity of these patterns and frameworks at the first layer 
of the CO2P3S programming model by showing how 
two applications, reaction—diffusion texture generation 
and parallel sorting by regular sampling, can be imple- 
mented. Further, we have shown that our frameworks 
can provide performance benefits. 


We also introduced the concept of phased design pat- 
terns to express temporal relationships in parallel pro- 
grams. These relationships may determine when to use 
parallelism and how that parallelism, if used, should be 
implemented. These phased patterns recognize that not 
every operation on an object has sufficient granularity to 
be run in parallel, and that a single parallel design pattern 
is often insufficient to efficiently parallelize an entire ap- 
plication. Instead, the parallel requirements of an appli- 
cation change as the application progresses. Phased pat- 
terns provide a mechanism for expressing this change. 


Currently, we are prototyping the CO2P3S system in 
Java. We are also looking for other parallel design pat- 
terns that can be included in the system, such as divide— 
and—conquer and tree searching patterns. Once the pro- 
gramming system is complete, we will investigate allow- 
ing users to add support for their own parallel design 
patterns by including new pattern templates and frame- 
works in CO2P3S (as can be done in DPnDP [15]), cre- 
ating a tool set to assist with debugging and tuning pro- 
grams (such as the Enterprise environment [10]), and 
conducting usability studies [14, 16]. 





5th USENIX Conference on Object-Oriented Technologies and Systems (COOTS '99) 


USENIX Association 


Acknowledgements 


This research was supported by grants from the National 
Science and Engineering Research Council of Canada. 
We are indebted to Doug Lea for numerous comments 
and suggestions that significantly improved this paper. 


References 


[1] B. Bacci, M. Danelutto, S. Orlando, S. Pelagatti, 
and M. Vanneschi. P®L: A structured high level 
parallel language and its structured support. Con- 
currency: Practice and Experience, 7(3):225-255, 
1995. 


[2 


Seed 


A. Beguelin, J. Dongarra, A. Giest, R. Manchek, 
and K. Moore. HeNCE: A heterogeneous network 
computing environment. Technical Report UT-CS- 
93-205, University of Tennessee, August 1993. 


F. Budinsky, M. Finnie, J. Vlissides, and P. Yu. Au- 
tomatic code generation from design patterns. /BM 
Systems Journal, 35(2):151-171, 1996. 


E. Gamma, R. Helm, R. Johnson, and J. Vlissides. 
Design Patterns: Elements of Reusable Object- 
Oriented Software. Addison—Wesley, 1994. 


(3 


— 


[4 


— 


[5] A. Grimshaw. Easy to use object-oriented paral- 
lel programming with Mentat. JEEE Computer, 
26(5):39-51, May 1993. 


[6] D. Lea. Concurrent Programming in Java: Design 
Principles and Patterns. Addison—Wesley, 1997. 


[7] S. MacDonald. Parallel object-oriented pattern 
catalogue. Available at http://www.cs.ualberta.ca/- 
~stevem/publications.html, 1998. 


[8] S. MacDonald, J. Schaeffer, and D. Szafron. 
Pattern—based object-oriented parallel program- 
ming. In Proceedings of the First International 
Scientific Computing in Object-Oriented Parallel 
Environments Conference (ISCOPE'97), volume 
1343 of Lecture Notes in Computer Science, pages 
267-274. Springer—Verlag, 1997. 


[9 


— 


ObjectSpace, Inc. ObjectSpace JGL: The Generic 
Collection Library for Java, Version 3.0, 1997. 
http://www.objectspace.com. 


[10] J. Schaeffer, D. Szafron, G. Lobe, and I. Parsons. 
The Enterprise model for developing distributed 
applications. JEEE Parallel and Distributed Tech- 
nology, 1(3):85-96, 1993. 


[11] 


[12] 


[13] 


[14] 


[15] 


[16] 


[17] 


D. Schmidt. The ADAPTIVE communication en- 
vironment: Object-oriented network programming 
components for developing client/server applica- 
tions. In Proceedings of the 12th Sun Users Group 
Conference, 1994. A list of the individual pat- 
terns can be found at http://siesta.cs.wustl.edu/- 
~schmidt/patterns-ace.html. 


M. Sefika, A. Sane, and R. Campbell. Monitor- 
ing compliance of a software system with its high— 
level design models. In Proceedings of the 18th 
International Conference on Software Engineering 
(ICSE-18), pages 387-396, 1996. 


H. Shi and J. Schaeffer. Parallel sorting by regu- 
lar sampling. Journal of Parallel and Distributed 
Computing, 14(4):361-—372, 1992. 


A. Singh, J. Schaeffer, and D. Szafron. Experience 
with parallel programming using code templates. 
Concurrency: Practice and Experience, 10(2):91- 
120, 1998. 


S. Siu, M. De Simone, D. Goswami, and A. Singh. 
Design patterns for parallel programming. In Pro- 
ceedings of the 1996 International Conference on 
Parallel and Distributed Processing Techniques 
and Applications (PDPTA'96), pages 230-240, 
1996. 


G. Wilson and H. Bal. Using the Cowichan prob- 
lems to assess the usability of Orca. JEEE Parallel 
and Distributed Technology, 4(3):36—44, 1996. 


A. Witkin and M. Kass. Reaction—diffusion tex- 
tures. Computer Graphics (SIGGRAPH '9] Proc- 
cedings), 25(4):299-308, July 1991. 





USENIX Association 5th USENIX Conference on Object-Oriented Technologies and Systems (COOTS '99) 


43 


4 ® 
% 7 
ey ‘ 
> 
© 
@*¢ oe © 
® ® 
t 
: 8 
& ‘ 
Ff 
‘ 
o 
> > 


@ +? 
® 
\ J 
. 
hee 
5 
” il 
e 
ry ; ia 
Lh e ye 
nn 
wi 


it 


USENIX Association 


Intercepting and Instrumenting COM Applications 


Galen C. Hunt 
Microsoft Research 
One Microsoft Way 
Redmond, WA 98052 
galenh@microsoft.com 


Abstract 

Binary standard object models, such as Microsoft's 
Component Object Model (COM) enable the develop- 
ment of not just reusable components, but also an in- 
credible variety of useful component services through 
run-time interception of binary standard interfaces. 
Interception of binary components can be used for con- 
formance testing, debugging, profiling, transaction 
management, serialization and locking, cross-standard 
middleware interoperability, automatic distributed par- 
titioning, security enforcement, clustering, just-in-time 
activation, and transparent component aggregation. 

We describe the implementation of an interception 
and instrumentation system tested on over 300 COM 
binary components, 700 unique COM interfaces, 2 mil- 
lion lines of code, and on 3 major commercial-grade 
applications including Microsoft PhotoDraw 2000. 
The described system serves as the foundation for the 
Coign Automatic Distributed Partitioning System 
(ADPS), the first ADPS to automatically partition and 
distribute binary applications. 

While the techniques described in this paper were 
developed specifically for COM, they have relevance to 
other object models with binary standards, such as in- 
dividual CORBA implementations. 


1. Introduction 


Widespread adoption of Microsoft’s Component Ob- 
ject Model (COM) [16, 25] standard has produced an 
explosion in the availability of binary components, re- 
usable pieces of software in binary form. It can be ar- 
gued that this popularity is driven largely by COM’s 
binary standard for component interoperability. 

While binary compatibility is a great boon to the 
market for commercial components, it also enables a 
wide range of unique component services through in- 
terception. Because the interfaces between COM com- 
ponents are well defined by the binary standard, a 
component service can exploit the binary standard to 
intercept inter-component communication and interpose 
itself between components. 

Interception of binary components can be used for 
conformance testing, debugging, distributed communi- 
cation, profiling, transaction management, serialization 
and locking, cross-standard middleware interoperabil- 
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ity, automatic distributed partitioning, security en- 
forcement, clustering and replication, just-in-time 
activation, and transparent component aggregation. 

In this paper, we describe an interception system 
proven on over 300 COM binary components, 700 
unique COM interfaces, and 2 million lines of code [5]. 
We have extensively tested our COM interception sys- 
tem on three major commercial-grade applications: the 
MSDN Corporate Benefits Sample [12], Microsoft 
PhotoDraw 2000 [15], and the Octarine word-processor 
from the Microsoft Research COM Applications Group. 
The interception system serves as the foundation for the 
Coign Automatic Distributed Partitioning System 
(ADPS) [7] [8], the first ADPS to automatically parti- 
tion and distribute binary applications. 

In the next section, we describe the fundamental fea- 
tures of COM as they relate to the interception and in- 
strumentation of COM applications. Sections 3 and 4 
explain and evaluate our mechanisms for intercepting 
object instantiation requests and inter-object communi- 
cation respectively. We describe related work in Sec- 
tion 5. In Section 6, we present our conclusions and 
propose future work. 


2. COM Fundamentals 


COM is a standard for creating and connecting com- 
ponents. A COM component is the binary template 
from which a COM object is instantiated. Due to 
COM’s binary standard, programmers can easily build 
applications from components, even components for 
which they have no source code. COM’s major fea- 
tures include multiple interfaces per object, mappings 
for common programming languages, — standard- 
mandated binary compatibility, and location-transparent 
invocation. 


2.1. Polymorphic Interfaces 


All first-class communication in COM takes place 
through interfaces. An interface is a strongly typed 
reference to a collection of semantically related func- 
tions. An interface is identified by a 128-bit globally 
unique identifier (GUID). An explicit agreement be- 
tween two components to communicate through a 
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named interface contains an implicit contract of the 
binary representation of the interface. 


Microsoft Interface Definition Language (MIDL) 


Figure | contains the definitions of two interfaces: 
TUnknown and Istream in the Microsoft Interface 
Definition Language (MIDL). Syntactically, MIDL is 
very similar to C++. To clarify the semantic features of 
interfaces, MIDL attributes (enclosed in square brackets 
[]) can be attached to any interface, member function, 
or parameter. Attributes specify features such as the 
data-flow direction of function arguments, the size of 
dynamic arrays, and the scope of pointers. For exam- 
ple, the [in, size_is(cb)] attribute on the pb 
argument of the Write function in Figure 1 declares 
that pb is an input array with cb elements. 


[uuid (00000000-0000-0000-c000-000000000046) ] 
interface IUnknown 
{ 
HRESULT QueryInterface ( 
{in] REFIID riid, 
{out,iid_is(riid)] void **ppoOb3j) ; 
ULONG AddRef () ; 
ULONG Release() ; 
+; 


[uuid (b3c11b80-9e7e-11d1-b6a5-006097b010e3) ] 
interface IStream : IUnknown 
{ 
HRESULT Seek ( 
[in] LONG nPos) ; 
HRESULT Read ( 
[out,size_is(cb)] BYTE *pb, 
{in] LONG cb); 
HRESULT Write ( 
{in,size_is(cb)] BYTE *pb, 
[in] LONG cb); 
Se 


Figure 1. MIDL for Two Interfaces. 


The MIDL definition of an interface describes its member 
functions and their parameters in sufficient detail to support 
location-transparent invocation. 





IUnknown 


The IUnknown interface, listed in Figure 1, is spe- 
cial. All COM objects must support Unknown. Each 
COM interface must include the three member func- 
tions from IUnknown, namely: QueryInterface, 
AddRef, and Release. AddRef and Release are 
reference-counting functions for lifetime management. 
When an object’s reference count goes to zero, the ob- 
ject is responsible for freeing itself from memory. 

COM objects can support multiple interfaces. Cli- 
ents dynamically bind to a new interface by calling 
QueryInterface. QueryInterface takes as 
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input the GUID of the interface to which the client 
would like to bind and returns a pointer to the new in- 
terface. Through run-time invocation of Query- 
Interface, clients can determine the exact 
functionality supported by any object. 


2.2. Common Language Mappings 


The MIDL compiler maps interface definitions into 
formats usable by common programming languages. 
Figure 2 contains the C++ abstract classes generated by 
the MIDL compiler, for the interfaces in Figure 1. 
MIDL has straightforward mappings into other com- 
piled languages such as C and Java. In addition, the 
MIDL compiler can store metadata in binary files called 
type libraries. Many development tools can import 
type libraries. Type libraries are well suited for script- 
ing languages such as the Visual Basic Scripting Edi- 
tion in Internet Explorer [11]. 


class IUnknown 
{ 
public: 
virtual HRESULT QueryInterface ( 
REFIID riid, 
void **ppObj) = 0; 
virtual ULONG AddRef() = 0; 
virtual ULONG Release() = 0; 
}7 


class IStream : 
{ 
public: 
virtual HRESULT Seek ( 
LONG nPos) = 0; 
virtual HRESULT Read ( 
BYTE *pb, 
LONG cb) = 0; 
virtual HRESULT Write ( 
BYTE *pb, 
LONG cb) = 0; 


IUnknown 


}; 


Figure 2. C++ Language Mapping. 


The MIDL compiler maps a COM interface into an abstract 
C++ class. 





2.3. Binary Compatibility 


In addition to language mappings, COM specifies a 
platform-standard binary mapping for interfaces. The 
binary format for a COM interface is similar to the 
common format of a C++ virtual function table (VTBL, 
pronounced “V-Table”). All references to interfaces 
are stored as interface pointers (an indirect pointer to a 
virtual function table). Figure 3 shows the binary map- 
ping of the [Stream interface. 


USENIX Association 


USENIX Association 


Each object is responsible for allocating and releas- 
ing the memory occupied by its interfaces. Quite often, 
objects place per-instance interface data immediately 
following the interface virtual-function-table pointer. 
With the exception of the virtual function table and the 
pointer to the virtual function table, the object memory 
area is opaque to the client. 

The standardized binary mapping enforces COM’s 
language neutrality. Any language that can call a func- 
tion through a pointer can use COM objects. Any lan- 
guage that can export a function pointer can create 
COM objects. 

COM components are distributed either in applica- 
tion executables (.EXE files) or in dynamic link librar- 
ies (DLLs). 
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Figure 3. Binary Interface Mapping. 


COM defines a standard binary mapping for interfaces. The 
format is similar to the common representation of a C++ pure 
abstract virtual function table. 





2.4. Location Transparency 


Binary compatibility is important because it facili- 
tates true location transparency. A client can commu- 
nicate with a COM object in the same process (in- 
process), in a different process (cross-process), or on an 
entirely different machine (cross-machine). The loca- 
tion of the COM object is completely transparent to 
both client and component because in each case invoca- 
tion takes place through an interface’s virtual function 
table. 


Interface Proxies and Stubs 


Location transparency is achieved through proxies 
and stubs generated by the MIDL compiler. Proxies 
marshal function arguments into a single message that 
can be transported between address spaces or between 
machines. Stubs unmarshal messages into function 
calls. Interface proxies and stubs copy data structures 
with deep-copy semantics. In theory, proxies and stubs 
come in pairs—the first for marshaling and the second 
for unmarshaling. In practice, COM generally com- 
bines code for the proxy and stub for a specific inter- 
face into a single reusable binary. COM proxies and 


stubs are similar in purpose to CORBA [19, 23] stubs 
and skeletons. However, their implementations vary 
because COM proxies and stubs are only used when 
inter-object communication crosses process boundaries. 


In-Process Communication 


For best performance, components reside in the cli- 
ent’s address space. An application invokes an in- 
process object directly through the interface virtual 
function table. In-process communication has the same 
cost as a C++ virtual function call because it uses nei- 
ther interface proxies nor stubs. The primary drawback 
of in-process objects is that they share the same protec- 
tion domain as the application. The application cannot 
protect itself from erroneous or malicious resource ac- 
cess by the object. 


Cross-Process Communication 


To provide the application with security, objects can 
be located in another operating-system process. The 
application communicates with cross-process objects 
through interface proxies and stubs. The application 
invokes the object through an indirect call on an inter- 
face virtual function table. In this case, however, the 
virtual function table belongs to the interface proxy. 
The proxy marshals function arguments into a buffer 
and transfers execution to the object’s address space 
where the interface stub unmarshals the arguments and 
calls the object through the interface virtual function 
table in the target address space. Marshaling and un- 
marshaling are completely transparent to both applica- 
tion and component. 


Cross-Machine Communication 


Invocation of distributed objects is very similar to 
invocation of cross-process objects. Cross-machine 
communication uses the same interface proxies and 
stubs as cross-process communication. The primary 
difference is that once the function arguments have 
been marshaled, COM sends the serialized message to 
the destination machine using the DCOM protocol [3], 
a superset of the Open Group’s Distributed Computing 
Environment Remote Procedure Call (DCE RPC) pro- 
tocol [4]. 


3. Interception of Object Instantiations 


COM objects are dynamic objects. Instantiated dur- 
ing an application’s execution, objects communicate 
with the application and each other through dynami- 
cally bound interfaces. An object frees itself from 
memory after all references to it have been released by 
the application and other objects. 

Applications instantiate COM objects by calling API 
functions exported from a user-mode COM DLL. Ap- 
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plications bind to the COM DLL either statically or 
dynamically. 

Static binding to a DLL is very similar to the use of 
shared libraries in most UNIX systems. Static binding 
is performed in two stages. At link time, the linker em- 
beds in the application binary the name of the DLL, a 
list of all imported functions, and an indirect jump table 
with one entry per imported function. At load time, the 
loader maps all imported DLLs into the application’s 
address space and patches the indirect jump table en- 
tries to point to the correct entry points in the DLL im- 
age. 

Dynamic binding occurs entirely at run time. A 
DLL is loaded into the application’s address space by 
calling the LoadLibrary Win32 function. After 
loading, the application looks for procedures within the 
DLL using the Get ProcAddress function. In con- 
trast to static binding, in which all calls use an indirect 
jump table, Get ProcAddress returns a direct pointer 
to the entry point of the named function. 


BindMoniker OleCreateDefaultHandler 
CoCreateInstance OleCreateEx 
CoCreateInstanceEx OleCreateFontIndirect 
CoGetClassObject OleCreateFromData* 
CoGet InstanceFromFile OleCreateFronFile* 
CoRegisterClassObject OleCreateLink* 
CreateAntiMoniker OleCreateStaticFromData 
CreateBindctx OleGetClipboard 
CreateClassMoniker OleLoad 
CreateDataAdviseHolder OleLoadFranStream 
CreateFileMoniker OleLoadPicture 
CreateGenericComposite OleLoadPictureFile 
CreateItenMoniker OleRegEnunFommatEtc 
CreateOleAdviseHolder OleRegEnuniVerbs 
CreatePointerMoniker StgCreateDocfile 
GetRunningObjectTable StgCreateDocfileOn* 
MkParseDisplayName StgGetIFillLockBytesOn* 
MonikerConmonPrefixwWith StgQpenAsyncDocfileOn* 
MonikerRelativePathTo StgOpenStorage 
OleCreate StgOpenStorageOn* 


Figure 4. Object Instantiation Functions. 

COM supports approximately 50 functions capable of creat- 
ing instantiation a new object. However, most instantiations 
request use either CoCreateInstance or CoCreate- 
InstanceEx. 


The COM DLL exports approximately 50 functions 
capable of instantiating new objects; these are listed in 
Figure 4. With few exceptions, applications instantiate 
objects exclusively through the CoCcreateInstance 
function or its successor, CoCreateInstanceEx. 
From the instrumentation perspective there is little dif- 
ference among the COM API functions. For brevity, 
we use CoCreate as a placeholder for any function 
that instantiates new COM objects. 
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3.1. Alternatives for Instantiation Interception 


To intercept all object instantiations, instrumentation 
should be called at the entry and exit of each object 
instantiation function. 

Figure 5 enumerates the techniques available for in- 
tercepting functions; namely: source-code call replace- 
ment, binary call replacement, DLL redirection, DLL 
replacement, breakpoint trapping, and inline redirec- 
tion. 





i// Application Source 







CoCreate(Clsid) 





App. 


push Clsid 
call [CoCreate] 


i push Clsid 
: m call [XCoCreate] 


icoCreate: iCoCreate: i 
? word _COM_CoCreates jf word _X_XCoCreate: 





week 












ary 
ve Replacement 


i COM_CoCreate: i COM_CoCreate: 





push ebp call xXCoCreate 
mov ebp, esp push ebp 
ee mov 







ebp, esp 







inary 


i COM_CoCreate: 
push ebp 

mov ebp, esp ; 
_X_xXCoCreate: 
ebp, esp 3 





Figure 5. Intercepting Instantiation Calls. 


Object instantiation calls can be intercepted by 1) call re- 
placement in the application source code; 2) call replacement 
in the application binary; 3) DLL redirection; 4) DLL re- 
placement; 5) trapping in the COM DLL; and 6) inline redi- 
rection in the COM DLL. 





Call replacement in application source code. 


Calls to the COM instantiation functions can be re- 
placed with calls to the instrumentation by modifying 
application source code. The major drawback of this 
technique is that it requires access to application source 
code. 


Call replacement in application binary code. 


Calls to the COM instantiation functions can be re- 
placed with calls to the instrumentation by modifying 
application binaries. While this technique does not 
require source code, replacement in the application bi- 
nary does require the ability to identify all applicable 
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call sites. To facilitate identification of all call sites, the 
application must be linked with substantial symbolic 
information. 


DLL redirection. 


The import entries for COM APIs in the application 
can be modified to point to another library. Redirection 
to another DLL can be achieved either by replacing the 
name of the COM DLL in the import table before load 
time or by replacing the function addresses in the indi- 
rect jump table after load. Unfortunately, redirecting to 
another DLL through either of the import tables fails to 
intercept dynamic calls using LoadLibrary and 
GetProcAddress. 


DLL replacement. 


The only way to guarantee interception of a specific 
DLL function is to insert the interception mechanism 
into the function code. The most obvious method is to 
replace the COM DLL with a new version containing 
instrumentation. DLL replacement requires source ac- 
cess to the COM DLL library. It also unnecessarily 
penalizes all applications using the COM DLL, whether 
they use the additional functionality or not. 


Breakpoint trapping of the COM DLL. 


Rather than replace the DLL, the interception 
mechanism can be inserted into the image of the COM 
DLL after it has been loaded into the application ad- 
dress space. At run time, the instrumentation system 
can insert a breakpoint trap at the start of each target 
instantiation function. When execution reaches the 
function entry point, a debugging exception is thrown 
by the trap and caught by the instrumentation system. 
The major drawback to breakpoint trapping is that de- 
bugging exceptions suspend all application threads. In 
addition, the debug exception must be caught in a sec- 
ond operating-system process. Interception via break- 
point trapping has a high performance penalty. 


Inline redirection of the COM DLL. 


The most favorable method for intercepting DLL 
functions is to inline the redirection call. At load time, 
the first few instructions of the target instantiation func- 
tion are replaced with a jump instruction to a detour 
function in the instrumentation. Replacing the first few 
instructions is usually a trivial operation as these in- 
structions are normally part of the function prolog gen- 
erated by a compiler and not the targets of any 
branches. The replaced instructions are used to create a 
trampoline. When the modified target function is in- 
voked, the jump instruction transfers execution to the 
detour function in the instrumentation. The detour 
function passes control to the remainder of the target 
function by invoking the trampoline. 





3.2. Evaluation of Instantiation Interception 


Our instrumentation system uses inline indirection to 
intercept object instantiation calls. At load time, our 
instrumentation replaces the first few instructions of the 
target function with a jump to the instrumentation de- 
tour function. Pages for code sections are mapped into 
a processes’ address space using copy-on-write seman- 
tics. Calls to VirtualProtect and Flush- 


InstructionCache enable modification of code 
pages at run time. Instructions removed from the target 
function are placed in a statically allocated trampoline 
routine. As shown in Figure 6, the trampoline allows 
the detour function to invoke the target function without 
interception. 


COM DLL Binary i;; COM DLL Binary 









i COM_CoCreate: i COM_CoCreate: ; 





push ebp - jmp _Coign_CoCreate: 
mov ebp,esp i COM_CoCreate+5: i 
push ebx push edi 

push esi 

push edi 

; Trampoline ip Trampoline 





i Trp_CoCreate: i Trp_CoCreate: 





jmp _COM_CoCreatei push ebp 
; mov ebp,esp 
push ebx 
push esi ; 
jmp _COM_CoCreate+5i 








Figure 6. Inline Redirection. 


The first few instructions of the target API function are 
moved to the trampoline and replaced with a jump to the in- 
terception system. The trampoline effectively invokes the 
API function without interception. On the Intel x86 architec- 
ture, a jump instruction occupies five bytes. 


Although inline indirection is complicated by the 
variable-length instruction set of the Intel x86 architec- 
ture, its low run-time cost and versatility more than 
offset the development penalty. Inline redirection of 
the CoCreateInstance function has less than a 3% 
overhead, which is more than an order of magnitude 
smaller than the penalty for breakpoint trapping. Table 
1 lists the average invocation time of the target function 
within a loop consisting of 10,000 iterations. The invo- 
cation times include the cost of redirection, but not any 
additional instrumentation. Unlike DLL redirection, 
inline redirection correctly intercepts both statically and 
dynamically bound invocations. Finally, inline redirec- 
tion is much more flexible than DLL redirection or ap- 
plication code modification. Inline redirection of any 
API function can be selectively enabled for each proc- 
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ess individually at load time based on the needs of the 
instrumentation. 

To apply inline redirection, our instrumentation sys- 
tem must be loaded into the application’s address space 
before the application executes. The current system is 
packaged as a DLL and post-linked to the application 
binary with a binary rewriter. Once loaded into the 
application address space, instrumentation is inlined 
into system DLL images. Mechanisms for inserting the 
interception system into an application’s address space 
are described fully in a paper on our Detours package 


[6]. 








Function 


VS. 









Interception Technique 


[piecccar | orn] 


Table 1. Interception Times. 


Listed are the times for intercepting either an empty function 
or CoCreateInstance on a 200MHz Pentium PC. 






Empty Function 
CoCreateInstance 







4. Intercepting Inter-Object Calls 


The bulk the interception system’s functionality is 
devoted to identifying interfaces, understanding their 
relationships to each other, and quantifying the com- 
munication through them. This section describes how 
our system intercepts interface calls. 

Invoking an interface member function is similar to 
invoking a C++ member function. The first argument 
to any interface member function is the “this” 
pointer, the pointer to the interface. Figure 7 lists the 
C++ and C syntax to invoke an interface member func- 
tion. 


4.1. Alternatives for Invocation Interception 


There are four techniques, described below, avail- 
able to intercept member function invocations: 
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Replace the interface pointer. 


Rather than return the object’s interface pointer, the 
interception system can return a pointer to an interface 
of its own making. When the client attempts to invoke 
an interface member function, it will invoke the instru- 
mentation, not the object. After taking appropriate 
steps, the instrumentation “forwards” the request to the 
object by directly invoking the object interface. In one 
sense, replacing the interface pointer is functionally 
similar to using remote interface proxies and stubs. For 
remote marshaling, COM replaces a remote interface 
pointer with a local interface pointer to an interface 


proxy. 
Replace the interface virtual function table pointer. 


The runtime can replace the virtual function table 
pointer in the interface with a pointer to an instrumenta- 
tion-supplied virtual function table. The instrumenta- 
tion can forward the invocation to the object by keeping 
a private copy of the original virtual function table 
pointer. 


Replace function pointers in the interface virtual 
function table. 


Rather than intercept the entire interface as a whole, the 


interception system can replace each function pointer in 
the virtual function table individually. 


Intercept object code. 
Finally, the instrumentation system can intercept mem- 


ber-function calls at the actual entry point of the func- 
tion using inline redirection. 


IStream *pIStream; 


// C++ Syntax 
piStream->Seek (nPos) ; 


// C Syntax 


piStream->pVtbl->pfSeek(pIStream, nPos); 


Figure 7. Invoking an Interface Function. 

Clients invoke interface member functions through the inter- 
face pointer. The first parameter to the function (hidden in 
C++) is the “this” pointer to the interface. 





4.2. COM Programming Idioms 


The choice of an appropriate technique for intercept- 
ing member functions is constrained by COM’s binary 
standard for object interoperability and common COM 
programming idioms. Our interception system attempts 
to deduce the identity of the each called object, the 
static type of the called interface, the identity of the 
called member function, and the static types of all func- 
tion parameters. In addition, our interception degrades 
gracefully. Even if not all of the needed information 
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can be determined, the interception system continues to 
function correctly. 

By design, the COM binary standard restricts the 
implementation of interfaces and objects to the degree 
necessary to insure interoperability. COM places four 
specific restrictions on interface design to insure object 
interoperability. First, a client accesses an object 
through its interface pointers. Second, the first item 
pointed to by an interface pointer must be a pointer to a 
virtual function table. Third, the first three entries of 
the virtual function table must point to the Query- 
Interface, AddRef and Release functions for 
the interface. Finally, if a client intends to use an inter- 
face, it must insure that the interface’s reference count 
has been incremented. 

As long as an object programmer obeys the four 
rules of the COM binary standard, he or she is com- 
pletely free to make any other implementation choices. 
For example, the component programmer is free to 
choose any appropriate memory layout for object and 
per-instance interface data. This lack of implementa- 
tion constraint is not an accident. The original design- 
ers of COM _ were convinced that no one 
implementation (even of something as universal as the 
QueryInterface function) would be suitable for all 
users. Instead, they attempted to create a specification 
that enabled binary interoperability while preserving all 
other degrees of freedom. 

Specification freedom breeds implementation diver- 
sity. This diversity is manifest in the number of com- 
mon programming idioms employed by COM 
component developers. These idioms are described 
here in sufficient detail to highlight the constraints they 
place on the implementation of a COM interception and 
instrumentation system. Each of these idioms has bro- 
ken at least one other COM interception system or pre- 
liminary versions of our interception system. 
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Figure 8. Simple Object Layout. 


The object instance is allocated as a single memory block. 
The block contains one VTBL pointer for each supported 
interface, an instance reference count, and other object- 
specific data. All interfaces share common implementations 
of QueryInterface, AddRef, and Release. 





Simple Multiple-Interface Objects 


Most objects support at most roughly a dozen inter- 
faces with no duplicates. It is common practice to lay 
out these simple objects in a memory block containing 
one VTBL pointer per interface, a reference count, and 
internal object variables; see Figure 8. Within the ob- 
ject’s member functions, a constant value is added to 
the “this” pointer to find the start of the memory 
block and to access object variables. All of the object 
interfaces use a common pair of AddRef and 
Release functions to maintain the object reference 
count. 


Multiple-Instance and Tear-off Interfaces 


Sometimes, an object must support multiple copies 
of a single interface. Multiple-instance interfaces are 
often used for iteration. A new instance of the interface 
is allocated for each client. Multiple-instance interfaces 
are typically implemented using a tear-off interface. A 
tear-off interface is allocated as a separate memory 
block. The tear-off interface contains the interface’s 
VTBL pointer, an interface-specific reference count, a 
pointer to the object’s primary memory block, and any 
instance-specific data. In addition to multiple-instance 
interfaces, tear-off interfaces are often used to imple- 
ment rarely accessed interfaces when object memory 
size must be minimized, (i.e. when the cost of the extra 
four bytes fora VTBL pointer per object instance is too 
expensive). 


Universal Delegators 


Objects commonly use a technique called delegation 
to export interfaces from another object to a client. 
Delegation is often used when one object aggregates 
services from several other objects into a single entity. 
The aggregating object exports its own interfaces, 
which delegate their implementation to the aggregated 
objects. The delegating interface calls the aggregated 
interface. This implementation is interface specific, 
code intensive, and requires an extra procedure call 
during invocation. The implementation is code inten- 
sive because delegating code must be written for each 
interface type. The extra procedure call becomes par- 
ticularly important if the member function has a large 
number of arguments or multiple delegators are nested 
through layers of aggregation. 

An obvious optimization and generalization of dele- 
gation is the universal delegator. A universal delegator 
is a type-independent, re-usable delegator. The data 
structure for a universal delegator consists of a VTBL 
pointer, a reference count, a pointer to the aggregated 
interface, and a pointer to the aggregating object. Upon 
invocation, a member function in the universal delega- 
tor replaces the “this” pointer on the argument stack 
with the pointer to the delegated interface and jumps 
directly to the entry point of the appropriate member 
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function in the aggregated interface. The universal 
delegator is “universal” because its member functions 
need know nothing about the type of interface to which 
they are delegating; they reuse the invoking call frame. 
Implemented in a manner similar to tear-off interfaces, 
universal delegators are instantiated on demand, one per 
delegated interface with a common VTBL shared 
among all instances. 


Explicit VTBL Pointer Comparison. 


Rather than using explicit constant offsets, some 
COM components implemented in C locate the start of 
an object’s main memory block by comparing VTBL 
interface pointers. For — example, the 
IStream::Seek member function of the object in 
Figure 8 starts with its “this” pointer pointing to 
pIStreamVtbl. The object locates the start of its 
memory structure by decrementing the “this” pointer 
until it points to a VTBL pointer equal to the known 
location of the VTBL for TUnknown. This calculation 
will produce erroneous results if an interception system 
has replaced the VTBL pointer. 


Explicit Function Pointer Comparison. 


In a manner similar to VTBL pointer comparison, 
some components perform calculations assuming that 
function pointers in the VTBL will have known values. 
These calculations break if the interception system has 
replaced a VTBL function pointer. 


4.3. Interface Wrapping 


Our instrumentation system intercepts invocation of 
interface member functions by replacing the interface 
pointer given to the object’s client with an interface 
pointer to a specialized universal delegator, the inter- 
face wrapper. The implementation of interface wrap- 
pers was chosen after evaluating the functionality of 
possible alternatives and testing their performance 
against a suite of object-based applications. 

For brevity, we often refer to the process of creating 
an individual interface wrapper and replacing the inter- 
face pointer with a pointer to an interface wrapper as 
wrapping the interface. We also refer to interfaces as 
being wrapped or unwrapped. A wrapped interface is 
one to which clients receive a pointer to the interface 
wrapper. An unwrapped interface is one either without 
a wrapper or with the interface wrapper removed to 
yield the original object interface. 

Interface wrapping provides an easy way to identify 
an interface and a ready location to store information 
about the interface: in the per-instance interface wrap- 
per. Unlike interface wrapping, inline redirection must 
store per-instance data in an external dictionary. Ac- 
cess to the instance-data dictionary is made difficult 
because member functions are often re-used by multiple 
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interfaces of dissimilar type. This is definitely the case 
for universal delegation, but common even for less ex- 
otic coding techniques. As a rule, almost all objects 
reuse the same implementation of QueryInter face, 
AddRef, and Release for multiple interfaces. 

Interface wrapping is robust, does not break applica- 
tion code, and is extremely efficient. Finally, as we 
shall see in the next section, interface wrapping is cen- 
tral to correctly identifying the object that owns an in- 
terface. 


4.4. The Interface Ownership Problem 


In addition to intercepting interface calls, the inter- 
ception system attempts to identify which object owns 
an interface. A major breakthrough in the development 
of our interception system was the discovery of heuris- 
tics to find an interface’s owning object. 

The interface ownership problem is complicated be- 
cause to COM, to the application, and to other objects, 
an object is visible only as a loosely coupled set of in- 
terfaces. The object can be identified only through one 
of its interfaces; it has no explicit object identity. 

COM supports the concept of an object identity 
through the IUnknown interface. As mentioned in 
Chapter 2, every interface must inherent from and im- 
plement the three member functions of IUnknown, 
namely: QueryInterface, AddRef, and Re- 
lease. Through the QueryInterface function, a 
client can query for any interface supported by the ob- 
ject. Every object must support the TUnknown inter- 
face. An object’s Unknown interface pointer is the 
object’s COM identity. The COM specification states 
that a client calling QueryiInterface- 
(IID_IUnknown) on any interface must always re- 
ceive back the same IUnknown interface pointer (the 
same COM identity). 

Unfortunately, an object need not provide the same 
COM identity (the same ITUnknown interface pointer) 
to different clients. An object that exports one COM 
identity to one client and another COM identity to a 
second client is said to have a split identity. Split iden- 
tities are especially common in applications in which 
objects are composed together through a technique 
known as aggregation. In aggregation, multiple objects 
operate as a single unit by exporting a common 
QueryInterface function to all clients. Due to 
split identities, COM objects have no system-wide, 
unique identifier. 


The Obvious Solution 


A client can query an interface for its owning 
IUnknown interface (its COM identity). In the most 
obvious implementation, the interception system could 
maintain a list of known COM identities for each ob- 
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ject. The runtime could identify the owning object by 
querying an interface for its COM identity and compar- 
ing it to a dictionary of known identities. 

In practice, calling QueryInterface to identify 
the owning object fails because QueryInterface is 
not free of side effects. QueryInterface incre- 
ments the reference count of the returned interface. 
Calling Release on the returned interface would dec- 
rement its reference count. However, the Release 
function also has side effects. Release instructs the 
object to check if its reference count has gone to zero 
and to free itself from memory in the affirmative. 
There are a few identification scenarios under which the 
object’s reference count does in fact go to zero. In the 
worse case scenario, attempting to identify an inter- 
face’s owner would produce the unwanted side effect of 
instructing the object to remove itself from memory! 


Sources of Interface Pointers 


To find a correct solution to the interface ownership 
problem, one must understand how a client receives an 
interface pointer. It is also important to understand 
what information is available about the interface. A 
client can receive an object interface pointer in one of 
four ways: from one of the COM API object instantia- 
tion functions; by calling QueryInterface on an 
interface to which it already holds a pointer; as an out- 
put parameter from one of the member functions of an 
interface to which it already holds a pointer; or as an 
input parameter on one of its own member functions. 
Recall that our system intercepts all COM API func- 
tions for object instantiation. At the time of instantia- 
tion, the interception system wraps the interface and 
returns to the caller a pointer to the interface wrapper. 


An Analogy for the Interface Ownership Problem 


The following analogy is helpful for understanding 
the interface ownership problem. A person finds her- 
self in a large multi-dimensional building. The building 
is divided into many rooms with doors leading from 
one room to another. The person is assigned the task of 
identifying all of the rooms in the building and deter- 
mining which doors lead to which rooms. Unfortu- 
nately, all of the walls in the building are invisible. 
Additionally, from time to time new doors are added to 
the building and old doors are removed from the build- 
ing. 

Mapping the analogy to the interface ownership 
problem; the building is the application, the rooms are 
the objects, and the doors are the interfaces. 

We describe the solution first in terms of the invisi- 
ble room analogy, then as it applies to the interface 
ownership problem. In the analogy, the solution is to 
assign each room a different color and to paint the 
doors of that room as they are discovered. The person 
starts her search in one room. She assigns the room a 


color—say red. Feeling her way around the room, she 
paints one side of any door she can find without leaving 
the room. The door must belong to the room because 
she didn’t pass through a door to get to it. After paint- 
ing all of the doors, she passes through one of the doors 
into a new room. She assigns the new room a color— 
say blue. She repeats the door-painting algorithm for 
all doors in the blue room. She then passes through one 
of the doors and begins the process again. The person 
repeats the process, passing from one room to another. 

If at some point the person finds that she has passed 
into a room where the door is already colored, then she 
knows the identity of the room (by the color on the 
door). She looks for any new doors in the room, paints 
them the appropriate color, and finally leaves through 
one of the doors to continue her search. 


The Solution to the Interface Ownership Problem 


From the analogy, the solution to the interface own- 
ership problem is quite simple. Each object is assigned 
a unique identifier. Each thread holds in a temporary 
variable the identity of the object in which it is cur- 
rently executing. Any newly found interfaces are in- 
strumented with an interface wrapper. The current 
object identity is recorded in the interface wrapper as 
the owning object. Finding the doors in a room is 
analogous to examining interface pointers passed as 
parameters to member functions. When execution exits 
an object, any unwrapped interface pointers passed as 
parameters are wrapped and given the identity of their 
originating object. By induction, if an interface pointer 
is not already wrapped, then it must belong to the cur- 
rent object. 

The most important invariant for solving the inter- 
face ownership problem is that at any time the intercep- 
tion system must know exactly which object is 
executing. Stored in a thread-local variable, the current 
object identifier is updated as execution crosses through 
interface wrappers. The new object identifier is pushed 
onto a local stack on entry to an interface. On exit from 
an interface wrapper (after executing the object’s code), 
the object identifier is popped from the top of the stack. 
At any time, the interception system can examine the 
top values of the identifier stack to determine the iden- 
tity of the current object and any calling objects. 

There is one minor caveat in implementing the solu- 
tion to the interface ownership problem. While clients 
should only have access to interfaces through interface 
wrappers, an object should never see an interface wrap- 
per instead of one of its own interfaces because the ob- 
ject uses its interfaces to access instance-specific data. 
An object could receive an interface wrapper to one of 
its own interfaces if a client passes an interface pointer 
back to the owning object as an input parameter on an- 
other call. The solution is simply to unwrap an inter- 
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face pointer whenever the pointer is passed as a pa- 
rameter to its owning object. 


4.5. Acquiring Static Interface Metadata 


Interface wrapping requires static metadata about in- 
terfaces. The interface wrapper must be able to identify 
all interface pointers passed as parameters to an inter- 
face member function. There are a number of sources 
for acquiring static interface metadata. Possible sources 
include the MIDL description of an interface, COM 
type libraries, and interface proxies and stubs. 

Acquiring static interface metadata from the MIDL 
description of an interface requires static analysis tools 
to parse and extract the appropriate metadata from the 
MIDL source code. In essence, it needs the MIDL 
compiler. Ideally, interface static metadata should be 
available to the interface wrapping code in a compact 
binary form. 

Another alternative is to acquire static interface 
metadata from the COM type libraries. COM type li- 
braries allow access to COM objects from interpreters 
for scripting languages, such as JavaScript [18] or Vis- 
ual Basic [13]. While compact and readily accessible, 
type libraries describe only a subset of possible COM 
interfaces. Interfaces described in type libraries cannot 
have multiple output parameters. In addition, the meta- 
data in type libraries does not contain sufficient infor- 
mation to determine the size of all possible dynamic 
array parameters. 

Static interface metadata is also contained in the in- 
terface proxies and stubs. MIDL-generated proxies and 
stubs contain marshaling metadata encoded in strings of 
marshaling operators (called MOP strings). Static inter- 
face metadata can be acquired easily by interpreting the 
MOP strings. Unfortunately, the MOP strings are not 
publicly documented. Through an extensive process of 
trial and error involving more than 600 interfaces, at the 
University of Rochester, we were able to determine the 
meanings of all MOP codes emitted by the MIDL com- 
piler. 

Our interception system contains a MOP interpreter 
and a MOP precompiler. A heavyweight, more accu- 
rate interception subsystem uses our homegrown MOP 
interpreter. A lightweight interception subsystem uses 
the MOP precompiler to simplify the MOP strings (re- 
moving full marshaling information) before application 
execution. 

The MOP precompiler uses dead-code elimination 
and constant folding to produce an optimized metadata 
representation. The simplified metadata describes all 
interface pointers passed as interface parameters, but 
does not contain information to calculate parameter 
sizes or fully walk pointer-rich arguments. Processed 
by a secondary interpreter, the simplified metadata al- 
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lows the lightweight runtime to wrap interfaces in a 
fraction of the time required with full MOP strings. 

While other COM instrumentation systems do use 
the MOP strings to acquire static interface metadata, 
ours is the first system to exploit a precompiler to opti- 
mize parameter access 

The interception system acquires MOP strings di- 
rectly from interface proxies and stubs. However, in 
some cases, components are distributed with MIDL 
source code, but without interface proxies and stubs. In 
those cases, the programmer can easily create interface 
proxies and stubs from the IDL sources with the MIDL 
compiler. OLE ships with about 250 interfaces without 
MOP strings. We were able to create interface proxies 
and stubs with the appropriate MOP string in under one 
hour using MIDL files from the OLE distribution. 


4.6. Coping With Undocumented Interfaces 


A final difficulty in interface wrapping is coping 
with undocumented interfaces, those interfaces without 
static metadata. While all documented COM interfaces 
should have static metadata, we have found cases where 
components from the same vendor will use an undocu- 
mented interface to communicate with each other. 

When a function call on a documented interface is 
intercepted, the interface wrapper processes the incom- 
ing function parameters, creates a new stack frame, and 
calls the object interface. Upon return from the object’s 
interface, the interface wrapper processes the outgoing 
function parameters and returns execution to the client. 
Information about the number of parameters passed to 
the member function is used to create the new stack 
frame for calling the object interface. For documented 
interfaces, the size of the new stack frame can easily be 
determined from the marshaling byte codes. 

When intercepting an undocumented interface, the 
interface wrapper has no static information describing 
the size of stack frame used to call the member func- 
tion. The interface wrapper cannot create a stack frame 
to call the object. It must reuse the existing stack 
frame. In addition, the interface wrapper must intercept 
execution return from the object in order to preserve the 
interface wrapping invariants used to identify objects 
and to determine interface ownership. 

For function calls on undocumented interfaces, the 
interface wrapper replaces the return address in the 
stack frame with the address of a trampoline function. 
The original return address and a copy of the stack 
pointer are stored in thread-local temporary variables. 
The interface wrapper transfers execution to the object 
directly using a jump rather than a call instruction. 

When the object finishes execution, it issues a return 
instruction. Rather than return control to the caller—as 
would have happened if the interface wrapper had not 
replaced the return address in the stack frame— 
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execution passes directly to the trampoline. As a fortui- 
tous benefit of COM’s callee-popped calling conven- 
tion, the trampoline can calculate the function’s stack 
frame size by comparing the current stack pointer with 
the copy stored before invoking the object code. The 
trampoline saves the frame size for future calls, and 
then returns control to the client directly through a jump 
instruction to the temporarily stored return address. 

The return trampoline is used only for the first invo- 
cation of a specific member function. Subsequent calls 
to the same interface member function are forwarded 
directly through the interface wrapper. 

By using the return trampoline, the interception sys- 
tem continues to function correctly even when con- 
fronted with undocumented interfaces. To our 
knowledge, our is the only COM instrumentation sys- 
tem to tolerate undocumented interfaces. 


4.7. Evaluation of Interface Wrapping 




















Function 
vs. 


Interception Technique 


TUnknown::AddRef 
IStream::Read 





Direct Call 
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Table 2. Interface Interception Times. 


Replace Interface Pointer 





Replace VTBL 





Replace Function Pointer 


Intercept Object Code 





Listed are the times for intercepting the TUnknown: :- 
AddRef and IStream: : Read (with 256 bytes of payload 
data) on a 200MHz Pentium PC. 





Detailed in Table 2, wrapping the interface by re- 
placing the interface pointer adds a 36% overhead to 
trivial function like TUnknown: :AddRef and just a 
3% overhead to a function like IStream: : Read. 
Processing the function arguments with interpreted 
MOP strings adds on average about 20% additional 
execution overhead while processing with precompiled 
MOP strings adds under 3% additional overhead. Re- 
placing the interface pointer is preferred over the alter- 
native interception mechanisms because it does not 
break under common COM programming idioms. 


5. Related Work 


Brown [1, 2] describes an interception system for 
COM using Universal Delegators (UDs). To use 
Brown’s UD, the application programmer is entirely 
responsible for wrapping COM interfaces. The pro- 
grammer must manually wrap each outgoing or incom- 
ing parameter with a special call to the UD code. While 
providing robust support for applications such as object 
aggregation, Brown’s UD is not suitable for binary-only 
interception and instrumentation. 

HookOle [10] is a general interception system for in- 
strumenting COM applications. Like our system, 
HookOle extracts interface metadata from MIDL MOP 
strings. However, rather than replacing interface point- 
ers, HookOLE replaces function pointers (in the VTBL) 
and assumes that the same function will not be used to 
implement multiple, dissimilarly typed interfaces. 
HookOLE breaks whenever an object uses universal 
delegation. HookOle provides no support for undocu- 
mented interfaces. The ITest Spy Utility [14] uses 
HookOle to provide a test harness for OLE DB compo- 
nents. 

Microsoft Transaction Server (MTS) [21] intercepts 
inter-component communication to enforce transaction 
boundaries and semantics. MTS wraps COM interfaces 
in a manner similar to our interception system. How- 
ever, MTS supports only a subset of possible COM 
interfaces and does not provide support for undocu- 
mented interfaces. 

COM+ [9] provides a generalized mechanism called, 
interceptors, for intercepting communication between 
COM+ objects. A significant redesign of COM, COM+ 
has complete control over the memory layout of all 
objects. This control significantly reduces the complex- 
ity of interception, but only works for newly designed 
COM+ components. 

COMERA [24] is an extensible remoting architec- 
ture for distributed COM communication. COMERA 
relies on existing DCOM [3] proxies and stubs to inter- 
cept cross-process communication. Neither COMERA 
nor DCOM support in-process interception. 

Eternal [17] intercepts CORBA IIOP-related mes- 
sages via the Unix /proc mechanism. Intercepted 
messages are broadcast to objects replicated for fault 
tolerance. The /proc mechanism is limited to cross- 
process communication and extremely expensive (re- 
quiring at least two crossings of process boundaries). 

Finally, a number of CORBA [23] vendors support 
interception and filtering mechanisms. In general, in- 
strumenting COM applications is more difficult than 
equivalent CORBA applications. COM standardizes 
interface format, but not object format. Each ORB 
specifies parts of the CORBA object format related to 
interception. So for example, the interface ownership 


5th USENIX Conference on Object-Oriented Technologies and Systems (COOTS '99) 





55 


56 


problem has no equivalent in CORBA, but the problem 
of instrumenting binary CORBA application independ- 
ent of ORB vendor remains unsolved. 


6. Conclusions and Future Work 


We have described a general-purpose interception 
system for instrumenting COM components and appli- 
cations. Important features of our interception system 
include inline redirection of all COM _ object- 
instantiation functions, interception of COM interfaces 
through interface wrappers, accurate tracking of inter- 
face ownership, and robust support for undocumented 
interfaces. 

Our interception system has been tested on over 300 
COM binary components, 700 unique COM interfaces, 
and 2 million lines of code. Using our interception sys- 
tem, the Coign ADPS has automatically partitioned and 
distributed three major applications including Microsoft 
PhotoDraw 2000. 

While our interception system is COM specific, the 
techniques described are relevant to CORBA ORBs. 
For example, inline redirection and interface wrappers 
could be used to intercept Portable Object Adapter 
(POA) [20] functions and object invocations [22]. 
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Abstract 


Some form of replicated data management is a ba- 
sic service of nearly all distributed systems. Repli- 
cated data management maintains the consistency 
of replicated data. In wide-area distributed systems, 
causal consistency is often used, because it is strong 
enough to allow one to easily solve many problems 
while still keeping the cost low even with the large 
variance in latency that one finds in a wide-area 
network. Causal logging is a useful technique for 
implementing causal consistency because it greatly 
reduces the latency in reading causally consistent 
data by piggybacking updates on existing network 
traffic. 


We have implemented a CORBA service, called 
COPE, that is implemented by using causal logging. 
COPE also shares features with some CORBA secu- 
rity services and is naturally implemented using the 
OrbixWeb interception facilities. In implementing 
COPE in OrbixWeb, we encountered several prob- 
lems. We discuss COPE, its implementation in Or- 
bixWeb, and the problems we encountered in this 
paper. We hope that this discussion will be of in- 
terest to both those who are implementing and who 
are planning on using CORBA interception facili- 
ties. 


1 Introduction 


In nearly all distributed systems, some data values 
are replicated across different processors. For ex- 
ample, one might have a distributed cache to allow 
reads to be performed more quickly. Or, one might 
have multiple copies of a critical data structure to 


ensure that should a processor crash or become iso- 
lated by a network failure, a copy of the data will 
still remain accessible. Replicated data manage- 
ment is therefore an essential service of distributed 
systems. 


One issue arising in replicated data management is 
how consistent the replicas need to be. Considerable 
effort has gone into defining and analyzing different 
consistency models. A very strong requirement is 
for the replicas to reflect the real-time order in which 
they were written. For example, consider two val- 
ues x and y that are replicated on processors a, b, ¢ 
and d, and let the initial values of 2 and y be zero. 
If processor a sets x to 1 before processor b sets y 
to 2, then c and d will both see x set to 1 before 
seeing y set to 2 (the same, of course, holds for a 
and 6). This kind of consistency, called atomic con- 
sistency [7], is in general expensive to implement, 
and thus used only when absolutely required. A 
weaker form of consistency is called sequential con- 
sistency [6] in which some total order on updates 
is imposed. With the above example, c and d will 
both see the same order of updates to x and y, but 
not necessarily the update of x before the update 
of y. Sequential consistency is less expensive to im- 
plement than atomic consistency and yet is often 
strong enough for many applications. 


A still weaker consistency property is causal consis- 
tency [1]. Suppose that processor b did not update 
y until it read 2 and found a value of 1. In this 
case, we say that the value of y causally depends 
upon the value of a—that is, the act of a setting x 
to 1 in some sense caused b to set y to 2. Causal 
consistency ensures that the sequence of updates as 
read by other processors is consistent with causal 
dependency. In this case, both ¢ and d would see 
the update of x before the update of y. On the other 
hand, if 6 did not read x before setting y, then the 
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updates are said to be concurrent. Concurrent up- 
dates are not ordered, and so c could see x updated 
before y while d could see y updated before zx. 


Preserving only causal dependencies among repli- 
cated data is sufficient for many applications [1]. We 
give an example of such an application later in this 
paper, namely optimistic execution in an object- 
oriented environment. And, causal consistency is 
cheaper to implement than sequential consistency 
in a wide-area setting. A problem with wide-area 
networks is the large variance in communication la- 
tency. With sequential consistency, a processor that 
updates a shared variable must ensure that its up- 
date is ordered with any other potentially concur- 
rent updates, and so the latency of an update can 
be no smaller than the longest latency from the up- 
dating processor to a copy of the data [11]. Causal 
consistency does not require concurrent updates to 
be ordered and so a processor can simply update 
its local copy and continue. There can, however, 
be latency introduced when reading a shared vari- 


able [10]. 


Latency can be reduced by implementing causal 
consistency through a technique called causal log- 
ging [3]. With causal logging, a message that up- 
dates a shared variable piggybacks the updates upon 
which the variable causally depends. That is, causal 
logging trades off bandwidth for latency. Causal log- 
ging has been used in several applications besides 
implementing causally consistent replicated data, 
including distributed simulation [5] and techniques 
for low-cost failure recovery [2]. These applications 
all share the same general property: a process does 
not observe an action (such as the delivery of a 
message or the update of a shared variable) until 
it has observed all actions that the observed action 
causally depends upon. 


In one of our research projects, we faced the problem 
of implementing causal logging when constructing a 
CORBA service. We call this service COPE and 
briefly describe it in Section 2. To implement this 
service in CORBA, we hoped to use an interception 
facility. Interception is a way to add functionality to 
CORBA services in a manner that is orthogonal and 
non-intrusive to the main computation. CORBA in- 
terception is implemented using interceptors which 
are code that can be invoked upon a message be- 
ing sent, or upon a message being received (as well 
as other trigger points). We describe why this fa- 
cility allows for a straightforward implementation 
of causal logging, as well as describing this imple- 
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mentation, in Section 3. We chose OrbixWeb as 
the platform for implementing COPE because it is 
a popular Java ORB that provides interception fa- 
cilities. We detail these features in Section 4. 


COPE is a somewhat complex CORBA service, and 
to the best of our knowledge no other group has 
considered a service with similar functionality. It 
shares some features with proposed CORBA secu- 
rity services, most notably Application Access Pol- 
icy [9]. In implementing COPE, we encountered sev- 
eral problems with OrbixWeb, which we describe in 
Section 5. We believe that the problems we encoun- 
tered will also be encountered by those implement- 
ing similar CORBA services. Our goal in writing 
this paper is to discuss the problems we encountered 
in hope that those building CORBA ORBs will be 
aware of them when building interception facilities. 


2 COPE 


We have implemented a new CORBA service called 
COPE that is based on causal logging. We give a 
brief overview of COPE to ground the discussion in 
Section 3 on what we require of a CORBA causal 
logging implementation. 


One of the two abstractions that COPE implements 
is the class of assumptions. An assumption is a 
CORBA object that is eventually either asserted 
or refuted. An assumption keeps track of the ob- 
jects that wish to be notified when it is resolved. 
Assumptions can also be subclassed to associate se- 
mantics with them, such as assumptions that de- 
pend on other assumptions. An example of such an 
assumption is a proposition which is expressed as a 
boolean formula over a symbol table. Each entry in 
the symbol table is itself an assumption. A proposi- 
tion assumption becomes asserted or refuted when 
the value of its formula evaluates to true or false 
as determined by the assumptions that have been 
asserted and refuted in the symbol table. 


The other abstraction that COPE implements is the 
class of optimists. An optimist is a CORBA object 
that takes on assumptions. Optimists can execute 
optimistically based on the assumptions that it has 
taken on. More specifically, an optimist can either 
checkpoint its state when it takes on an assumption 
or it can block, awaiting the eventual assertation 
or refutation of the assumption. If, in either case, 
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the assumption is asserted then the optimist is no- 
tified so that it can either discard the checkpoint or 
continue execution. If, on the other hand, the as- 
sumption is refuted, then the optimist is notified so 
that it can either roll it state back to the associated 
checkpoint or continue execution knowing that the 
assumption was refuted. 


Assumptions are causally consistent with respect to 
CORBA communications. For example, consider 
optimist a invoking method b.m on optimist b. If 
a has taken on an assumption z which is still unre- 
solved by the time a invokes b.m, then b must take 
on x by the time it begins execution of b.m. Sim- 
ilarly, if b.m constructs an optimist c, then ¢ must 
also take on x by the time c completes initialization. 


Put into terms of shared memory, each optimist has 
a list of replicas of unresolved assumption. These 
replicas are causally consistent, where “causally de- 
pends” is defined in terms both of optimists invoking 
methods on other optimists and of optimists creat- 
ing new optimists. This list is maintained as follows: 


1. When an optimist a makes a method invocation 
on an optimist b, it piggybacks on the method 
invocation a list of unresolved assumptions that 
a has taken on. 


2. When an optimist b has a method invoked by 
an object a, b checks to see if a has a class that 
derives from Optimist. If so, then b strips off 
any assumptions piggybacked on the method 
invocation and decides whether to add them 
to its own list of unresolved assumptions or to 
block the invocation. 


3. When an optimist a creates an optimist 6, it 
makes a method call to an optimist factory. As 
when invoking a method on an optimist, @ pig- 
gybacks on the method invocation a list of as- 
sumptions that a has taken on. The factory f 
checks to see if a has a class that derives from 
Optimist. If so, it then passes a’s unresolved as- 
sumptions to 6. Optimist b can choose either to 
accept a’s assumptions, in which case the cre- 
ation is successful, or to deny them, in which 
case b is not created. 


These three rules for maintaining the list of assump- 
tions together implement causal logging of assump- 
tions. Other features of COPE, such as assertion 
resolution and object notification, are not germane 


to the discussion. Interested readers can find further 
details on COPE in [8]. 


3 Implementing Causal Logging Us- 
ing Interception 


Consider implementing causal logging on top of 
CORBA. The following properties of an implemen- 
tation state what we believe constitutes a well- 
engineered solution. 


e Transparency. The piggybacking and strip- 
ping of piggybacked information should be im- 
plemented without explicit involvement of the 
objects using causal logging. To do otherwise 
would make it hard to ensure that the causal 
logging mechanism is correct. 


e Scheduling. Causal logging implies that in- 
formation is made available to an object at cer- 
tain points in its execution. To do otherwise 
might violate the causal consistency condition. 
Hence, the causal logging mechanism notifies of 
the delivery of causal information ordered with 
respect. to the invocation of methods and cre- 
ation of new objects. 


e Context Sensitivity. The information that 
an object a piggybacks to an object b may de- 
pend both on the state and class of a and on 
the class of 6. Without knowledge about a, it 
is hard to piggyback any information because 
there is no way to obtain it, and without in- 
formation about 6 an object would have to pig- 
gyback all possibly useful information on every 
method invocation. The latter would both be 
inefficient and would pose a possible security 
problem. 


CORBA interception is ideally suited as a piggy- 
backing mechanism that provides the properties of 
transparency and scheduling. Interceptors are or- 
thogonal to the regular path of computation, and 
therefore provide transparency. Furthermore, since 
interception can be placed at many points in the 
method invocation sequencing, it can be used to 
provide scheduling as well. 


A simple implementation of causal logging would be 
as follows. Consider an interception mechanism in 


5th USENIX Conference on Object-Oriented Technologies and Systems (COOTS '99) 





59 


60 


which every object has an interceptor that is specific 
to that object. The interceptor knows the identity 
of the object with which it is associated, and the 
interceptor is invoked upon both ends of a method 
invocation—that is, by the invoked object and by 
the invoking object. 


When a method is invoked, the interceptor on the 
invoking object uses some mechanism to determine 
what type of information is to be piggybacked. This 
mechanism can base its determination on the class 
of the invoked object. The actual information can 
be determined from the current state and the class 
of the invoking object. Hence, context sensitivity is 
implemented. The interceptor adds this data to the 
outgoing method invocation. 


When the invoked object receives the invocation, 
its interceptor removes the data that was added 
by the invoking object’s interceptor. The invoked 
object’s interceptor then implements scheduling by 
ordering the method invocations that deliver the 
causally logged information with respect to the in- 
coming method invocation. 


This scheme implements encapsulation of the 
method invocations within the causal logging mech- 
anism. That is, the underlying method invocations 
are not altered, but rather are used as a conduit 
of causally logged information and are scheduled to 
maintain causal consistency. The CORBA intercep- 
tion facility is intended for exactly this kind of en- 
capsulation. Our simple model of CORBA intercep- 
tion requires the following capabilities: 


1. Interceptors should be invoked on all outgoing 
and incoming invocations. 


2. Interceptors should be able to add information 
when a method invocation is initiated at the 
invoking side and remove information when the 
method invocation is initiated at the invoked 
side. 


3. An interceptor on the invoking side should 
know the state and class of the object with 
which it is associated and the class of the object 
being invoked. 


4. An interceptor on the invoked side should have 
the ability to make method calls on the object 
with which it is associated before it allows the 
initiating method to be invoked. 
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As discussed in Section 5, we had a few difficulties 
in creating such an architecture in OrbixWeb. 


4 OrbixWeb 


In COPE we implement interception by using Or- 
bixWeb filters. OrbixWeb is an implementation of 
CORBA by IONA ‘Technologies Inc. It is compli- 
ant with the Object Management Group’s (OMG) 
CORBA specification Version 2.0 and is imple- 
mented in Java. It provides a filtering mechanism 
with support for piggybacking of additional infor- 
mation onto method invocations. There are two 
types of filters in OrbixWeb: per-process and _per- 
object filters. We describe both types in turn. 


4.1 Per-Process Filters 


A per-process filter is code that is associated with 
a client or server process. The filter monitors 
all incoming and outgoing method invocations and 
attribute references that reference objects associ- 
ated with another process. More than one pro- 
cess filter can be chained together to the same pro- 
cess. There are ten points where code in a filter 
can be associated with a process: inRequestPre- 
Marshal, outRequestPreMarshal, inReplyPreMar- 
shal, outReplyPreMarshal, inRequestPostMarshal, 
outRequestPostMarshal, inReplyPostMarshal, out- 
ReplyPostMarshal, inReplyFailure, and outReply- 
Failure. Figure 1 (taken from [4]) illustrates these 
filter points. 


The name of a filter method indicates where in the 
method invocation sequence it is invoked. The mod- 
ifier request or reply indicates whether the filter is 
associated with the invocation of the method or the 
reply from the method. The modifier in or out indi- 
cates the direction the method invocation or method 
reply is going with respect to the process with which 
the filter is associated. In particular, out indicates 
the invoking object for method invocations and the 
invoked object for method reply, and in indicates the 
invoked object for method invocations and the in- 
voking object for method reply. The stem indicates 
exactly where in the processing of the method invo- 
cation the filter is associated: PreMarshal is before 
parameter marshalling, PostMarshal is after param- 
eter marshalling. Failure is for exceptions. Specifi- 
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Caller Process Target Process 


outRequestPreMarshal inRequestPreMarshal 


request 


outRequestPostMarshpl ~ a 


outReplyFailure outReplyPreMarshal 


inReplyPostMarshal outReplyPostMarshal 





Figure 1: Per-process filter monitor points 


cally, code that is associated with failure filter points 11 } catch (SystemException se) { 

is executed under two conditions: (1) when an ex- 12 System.out.println("Caught exception "+se) ; 
ception condition is raised by the target of the invo- } 

cation or (2) when there are return values from any 
preceding filter points indicating that the call is not 
to be processed any further. 


13 System.out.println ("Request to "+ s); 
14 System.out.println ("with operation "+ 0); 
15 return true; // continue the call 


The OrbixWeb abstract class IE.Iona.OrbixWeb. : 


Features .Filter implements per-process filters. A 
user-defined filter is implemented by defining a class 
that inherits from the Filter class. When a process 
creates an object of this class, the newly-created 
filter is associated with the creating object’s process. 
Successive filter creation results in these filters being 
chained in the order of their creation. 


The following demonstrates the construction of a —_ Information about the invocation such as the tar- 
per-process filter [4]: get, operation name, and arguments can be ac- 
cessed through the parameter r, which is of type 
IE. Iona.OrbixWeb.CORBA.Request. For example, 


1 public class ProcessFilter extends Filter { the call r.target() in Line 6 of the code above 
2 public boolean outReplyPreMarshal (Request r) returns the target of the invocation while the call 

{ r.operation() in Line 7 returns the name of the 
3 String s, 0; operation being invoked. 


4 long 1 = 27; 
As mentioned previously, OrbixWeb also allows ex- 


5 try f ue ; ; tra information to be piggybacked onto the method 
6 8 = ORB. init () .object_to_string( invocation. The call _OrbixWeb.Request(r). 
(r.target())); : ; 

: create_output_stream() in Line 9 above creates 

7 o = r.operation (); : ; . : 
a stream to which extra information can be written 
8 OutputStream outs = (outs.writelong(1) in Line 12). This informa- 
9 _OrbixWeb.Request (r) .create_output_stream() ; tion can later be read by the corresponding filter 


10 outs.write_long (1); point on the other side of the invocation (in this 
example, an inReply filter on the client side). 
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4.2 Per-Object Filter 


Filters can be associated with a given object as 
well. To define a per-object filter, one defines a Java 
object that implements the Java interface for the 
CORBA object that was generated by the IDL com- 
piler. This new Java object implements the desired 
filters. For example, assume the CORBA object a 
provides a method m. The IDL compiler generated 
Java interface includes the declaration of method m. 
Any per-object filter that can be associated with the 
class a must implement this interface, including the 
method m. The association is done by having the 
implementation of a create an instance of the per- 
object filter object, and then specify where in the 
method invocation path the filter is to be invoked. 
The following demonstrates the association of two 
per-object filters filter1 and filter2 with an ob- 
ject of class Foo: 


1 Foo foo; 

2 (( _FooSkeleton) foo)._preObject 
= filteri; 

3 (( _FooSkeleton) foo)._postObject 
= filter?2; 


This mechanism for implementing a per-object filter 
is simple and elegant. However, as will be discussed 
in Section 5.2, it is too restrictive for our purposes. 


5 Problems 


In building COPE, we encountered difficulties in us- 
ing OrbixWeb. This section describes two of these 
difficulties: 


1. Representing the CORBA inheritance of an ob- 
ject in the underlying Java implementation. 


2. The lack of support for a generic per-object fil- 
ter. 


We also give our corresponding workarounds and 
evaluate their effectiveness. 
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5.1 CORBA Inheritance Issues 


This problem arose because COPE piggybacks ref- 
erences to assumption objects, and it is common 
to derive specific kinds of assumptions from the as- 
sumption class. It is always somewhat complex fig- 
uring out how to implement a CORBA-based pro- 
gram using a specific ORB, but figuring out how to 
structure COPE was especially hard. It took ap- 
proximately a month of mail exchange with Iona 
before the problem was understood well enough to 
resolve. 


Most ORBs are implemented using an object- 
oriented language, such as Java or C++. A 
CORBA-based application is defined, in IDL, as 
a set of classes that may be related via single in- 
heritance. The IDL compiler translates these in- 
heritance relations in some manner into a set of 
class definitions in the implementation language. 
Consider a CORBA class Bar that inherits from 
a CORBA class Foo. The implementation object 
(say, BarImp1) should also inherit from the imple- 
mentation object FooImpl. In addition, with most 
IDL compilers both implementation objects are in- 
stances of a base CORBA object class. 


The problem discussed here is concerned with the 
translation chosen by the OrbixWeb IDL compiler.! 
Before discussing the problem further, we provide 
some background concerning how objects are im- 
plemented in OrbixWeb. Suppose we have an IDL 
interface as follows: 


interface Foo { 
void foo_method(); 
23 


interface Bar : Foo { 
void bar_method(Foo f); 
Bi 


The OrbixWeb IDL compiler generates eight 
files for each CORBA _ interface. For exam- 
ple, the CORBA interface Foo is compiled into 
FooHolder, FooHelper, -FooSkeleton, -FooStub, 
-FooImplBase, -tie-Foo, -FooOperations, and 
Foo. The first two are helper classes that support 
marshalling, narrowing, and other CORBA support 

Recall that Java supports only single inheritance, and so 


a translation based on multiple inheritance is only possible 
through interfaces, not classes. 
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Dynamiclmplementation 


_FooSkeleton 









_FooStub 


FooHelper 


Figure 2: IDL-generated files for Foo 


FooHolder 


_tie Foo 


Foolmp|Base 


operations. The next two are classes that are place- 
holders for the stubs. The last four, two classes and 
two interfaces respectively, are described in more 
detail below. 


The complete inheritance hierarchy for Foo is de- 
picted in Figure 2. Rectangles represent interfaces. 
Shaded ovals and rectangles represent classes and 
interfaces provided in standard packages such as 
org.omg.CORBA. These packages are part of Or- 
bixWeb core classes. 


There are two approaches with which one can im- 
plement an OrbixWeb object: the “Imp|lBase” ap- 
proach and the “tie” approach [4]. Suppose you wish 
to implement the CORBA class Foo with the Java 
class FooImp1. In the Imp]Base approach, FooImp1 
has the following signature: 


public class FooImpl extends _FooImp1Base 


That is, FooImpl is a subclass of the class 
FooImplBase. And, since -FooImplBase imple- 
ments the Java interface Foo (See Figure 2), 
FooImp1 also implements Foo. Therefore, a CORBA 
Foo object can be instantiated as follows: 


Foo foo = new FooImpl(); 


The tie approach, in contrast, is a delegation 
model. With this approach the FooImpl class im- 
plements the FooOperations interface. However, 
since _FooOperations and Foo are both interfaces, 
to instantiate a CORBA Foo object one first instan- 
tiates a FooImpl instance and then instantiates a 









_tie_Foo ——-» Foolmpl 
has a 


f 


Lo 
He_ ia Barlmpl 


Figure 3: Inheritance Diagram 


-tieFoo instance with the FooImp1 instance as a 
parameter: 


Foo foo = new _tie_Foo( new FooImpl() ); 


Since _tie_Foo implements Foo, the variable foo 
has type Foo as desired. 


Now consider implementing the class Bar. One 
would naturally wish the implementation BarImp1 
to inherit from FooImpl. But, BarImp1 also im- 
plements the methods declared in the CORBA Bar 
class. As was done in implementing Foo, we can use 
either the ImplBase approach or the tie approach. 
However, the Imp|lBase approach requires BarImp1l 
inheriting from BarImp1Base. This implies multi- 
ple inheritance of classes, which Java does not sup- 
port. Thus, one is constrained to use the tie ap- 
proach, viz.: 


public class BarImpl extends FooImpl 
implements _BarOperations 


Since we are using the tie approach, an instance of 
Bar is created by wrapping an instance of BarImp1 
in an instance of _tie_Bar: 


Bar bar = new _tie_Bar( new BarImpl() ); 


Figure 3 illustrates the resulting inheritance dia- 
gram of a FooImp1 object and a BarImp1 object. 


The problem with the tie approach, however, is that 
the implementation objects (FooImp1 and BarImp1 
in this example), are not instances of the Java in- 
terface that represents the CORBA object (Foo and 
Bar in this example). This poses a problem when 
using CORBA operations. 
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For example, suppose that there is a CORBA Bar 
object B1 on processor 1 and a CORBA Bar object 
B2 on processor 2. B2 has a reference r to Bi, and 
invokes the method r.bar_method(this). 


Further suppose that the implementation of 
bar_method in BarImp1 tests the parameter f to see 
if it is an instance of Bar: 


1 public bar_method (Foo f) { 
2 if (f instanceof Bar) 
3 7 
4 else 
5 
} 


One might expect that the code in Line 3 would 
be executed, but it is not due to an implemen- 
tation decision by Iona. While marshalling this 
on B2, CORBA determines the class of the value 
it is marshalling through a method on it named 
type, e.g., it invokes this.type(). If this 
were to implement the Java interface Bar, then 
this.type() would return a value indicating the 
CORBA class Bar. But, since this implements 
-BarOperations, this.type() returns the value 
null. The marshalling code therefore declares the 
parameter passed in the message to B1 to be of type 
reference to Foo. 


We dealt with this problem in the manner recom- 
mended by Iona. We save in the Java implemen- 
tation of every CORBA object a reference to the 
tie object. For example, let the member variable 
referring to the tie object of an instance of Bar be 
tieObject. The declaration of tieObject and the 
constructor in the definition of the class BarImp1l 
can be as follows:” 


public class BarImpl 
implements _BarOperations { 
protected Bar tieObject; // declaration 


BarImpl() { 


tieObject = new _tie_Bar( this ); 
} 
- 


2Note that the setting of tieObject must be the last mem- 
ber variable initialization. If not, then the member variables 
of the tie object may be initialized to incorrect values. 
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Then, B2 invokes r.bar_method(tieObject) in- 
stead of r.bar_method(this). Since tieObjects 
implements Bar, tieObject.type() returns a value 
indicating the CORBA class Bar. 


This problem occurs in other situations. It is in 
general a good OrbixWeb design practice to have 
objects like BarImp1 implement a method that. re- 
turns a reference to the tie object. This reference 
should be used in all places where a reference to the 
implementation is passed using CORBA. 


This additional complexity in structure can be 
avoided by enforcing a file naming structure on the 
user’s code. For example, some IDL compilers gen- 
erate a file that the user edits to include the Java 
implementation of the CORBA object. Unlike Or- 
bixWeb, the IDL compiler knows the name of the 
implementation class when it generates the files. 
Hence, the IDL compiler can generate files that ex- 
plictly inherit from this class as needed. In our ex- 
ample, if the IDL compiler names the implementa- 
tion class for Foo as Foo0bj, then the implementa- 
tion class for Bar might be generated as 


public class BarObj 
extends FooObj 
implements _BarOperations 


5.2 Per-Process Filters 


The filters that implement COPE’s causal logging 
perform the same operations for each method of an 
optimist object: they add assumptions to a method 
invocation on the invoking object’s side and remove 
the assumptions on the invoked object’s side. Ide- 
ally, one would like to be able to associate the same 
filter with each method of any class that derives 
from optimist, and this association should be done 
in a general way. Unfortunately, this cannot be done 
in OrbixWeb. As discussed in Section 4.2, an object 
that implements per-object filters is required to im- 
plement all methods defined in the IDL definition 
for the class a of objects with which it is associated. 
Classes that inherit from a require their own filters 
to be explicitly implemented. 


Since a general purpose filter cannot be constructed 
as a per-object filter and since we are unwilling to 
change the OrbixWeb IDL compiler, a per-process 
filter is our only option. A per-process filter is in- 
voked for all method invocations leaving and enter- 
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ing the process, and so it can be used to implement 
a generic filter. However, using per-process filters 
raises other problems. We have found workarounds 
for these problems, but we do not believe that the 
workarounds are acceptable in terms of meeting our 
engineering requirements. 


5.2.1 Performance 


There are performance reasons leading one to imple- 
ment several objects in the same process. It is rel- 
atively inexpensive for these objects to invoke each 
other’s methods, since such an invocation does not 
require a context switch. Furthermore, there are 
serious problems in resource utilization with run- 
ning multiple Java virtual machines on the same 
processor. Hence, one often tries to structure an 
OrbixWeb application with as many objects as pos- 
sible in the same process. 


However, per-process filters are not invoked for 
method invocations between objects in the same 
process. Hence, per-process filters can not be used 
to implement causal logging among objects in the 
same process. This implies that each COPE op- 
timist must run in its own process. This solution 
carries an enormous performance penalty. 


5.2.2 Lack of Required Information 


An OrbixWeb filter can obtain various information 
about the invocation that it is intercepting. This in- 
formation includes the reference of the object whose 
method is being invoked, the name of the method 
being invoked, the parameters of the method be- 
ing invoked, and the name of the user running the 
program that resulted in the invocation. Unfortu- 
nately, the filter cannot determine the reference of 
the object making the invocation. 


Without this information, it is impossible to provide 
the context sensitivity property defined in Section 3. 
The reason is that the piggybacked data depends 
on the state of the invoking object. Thus, the filter 
cannot obtain the information to be piggybacked 
from the invoking object. 


A per-object filter ts aware of the object with which 
it is associated, and so does not suffer from this 
problem. However, as we saw in Section 5.2, per- 


object filters cannot be used. Hence, we needed to 
find a workaround. 


Because of the constraint of having only one object 
per process, we work around this problem by allo- 
cating a static variable that contains a reference to 
the tie object. This static variable is referenced by 
the filters when they need to make a invocation on 
the invoking process. This is a simple workaround, 
but it is artificially simple because of the constraint 
of one object per process. If the per-process filter 
could be imposed for communications between ob- 
jects in the same process, then this static variable 
would need to be updated before every method in- 
vocation. Doing so would violate transparency of 
Section 3. 


6 Discussion 


In implementing causal logging to achieve causal 
consistency for our CORBA service, COPE, we use 
the interception facilities provided by OrbixWeb fil- 
ters. OrbixWeb filters allow us to add functionality 
to legacy software in a manner that is orthogonal 
and non-intrusive to the main computation. Using 
filters, invocations in the system can be captured 
and processed before continuing with the normal 
flow of the program. However, despite our best 
endeavors, we were faced with a few difficulties. 
One problem was in understanding how to structure 
an OrbixWeb application that passes references to 
CORBA objects that can be subclassed. With the 
help of Iona, we were able to find a practical solu- 
tion. The two remaining problems were more seri- 
ous: 


1. One cannot impose a per-object filter that is 
generic—that is, that need not conform to the 
interface implemented by the object. The level 
of interception implementable using OrbixWeb 
filters therefore poses a fundamental problem. 
As we discovered in our case, getting around 
the problem incurs a very high performance 
penalty and is thus not a practical solution. 


2. One has no means to access the calling object 
in a filter. This violation of context sensitivity 
property leads to the static variable solution 
which, except for the problem above, would vi- 
olate the transparency property. 
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It is not clear why such a limit to disallow ac- 
cess to the calling object was imposed in the 
first place. But with this limitation, the piggy- 
backing mechanism in place cannot be utilized 
to its full potential. 


We believe that any communications middleware 
platform for distributed computing should be pow- 
erful enough to build the engineering solution de- 
scribed in Section 3. Such a solution might even 
be considered a benchmark for the utility of such 
platforms. 


Although we have not attempted to implement 
COPE using other CORBA ORBs, we have looked 
into using the Legion system [12]. And, Legion ap- 
pears to be powerful enough to efficiently implement. 
COPE, but it would be interesting to actually do 
so to see what problems might arise. Of course, 
it would also be interesting to see how well other 
CORBA ORBs could support COPE. 
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8 Availability 


The source of COPE is available from the COPE 
home page at the following URL: 
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http://www.cs.ucsd.edu/users/ 
marzullo/COPE. html 
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Interoperability 


Computing systems deliver their functionality at a cer- 
tain level of performance, reliability, and security. We refer 
to such non-functional aspects as quality-of-service (QoS) 
aspects. Delivering a satisfactory level of QoS is very chal- 
lenging for systems that operate in open, resource varying 
environments such as the Internet or corporate intranets. 
A system that operates in an open environment may rely 
on services that are deployed under the control of a differ- 
ent organization, and it cannot per se make assumptions 
about the QoS delivered by such services. Furthermore, 
since resources vary, a system cannot be built to operate 
with a fixed level of available resources. To deliver satis- 
factory QoS in the context of external services and varying 
resources, a system must be QoS aware so that it can com- 
municate its QoS expectations to those external services, 
monitor actual QoS based on currently available resources, 
and adapt to changes in available resources. 

A QoS-aware system knows which level of QoS it needs 
from other services and which level of QoS it can pro- 
vide. To build QoS-aware systems, we need a way to ex- 
press QoS requirements and properties, and we need a way 
to communicate such expressions. In a realistic system, 
such expressions can become rather complex. For exam- 
ple, they typically contain constraints over user-defined do- 
mains where constraint satisfaction is determined relative 
to a user-defined ordering on the domain elements. To 
cope with this complexity we are developing a specifica- 
tion language and accompanying runtime representation 
for QoS expressions. This paper introduces our language 
but focuses on the runtime representation of QoS expres- 
sions. We show how to dynamically create new expressions 
at runtime and how to use comparison of expressions as a 
foundation for building higher-level QoS components such 
as QoS-based traders. 


1. Introduction 


Enterprises increasingly rely on distributed computer 
systems for business-critical functions. They often use 
such systems for internal information sharing, handling 
of business tasks such as orders and invoices, and ac- 
counting. In addition, businesses increasingly rely on 
distributed systems for their interactions with other 
business such as partners, customers, and sub contrac- 
tors. 

Since distributed systems are business critical, they 
must not only provide the right functionality, they must 
also provide the right quality-of-service (QoS) charac- 


teristics. By QoS, we refer to non-functional properties 
such as performance, reliability, quality of data, timing, 
and security. For some applications, best-effort QoS is 
acceptable; while others require predictable or guaran- 
teed levels of QoS to function properly. In real-time sys- 
tems, for example, timing is essential for correctness. In 
banking systems, security is necessary and must not be 
compromised. Business-critical enterprise systems and 
telecommunications systems must be highly available. 

Ideally, we would like all systems to be up 100% of 
the time, be fully secure, and deliver exceptional perfor- 
mance. Unfortunately, building such systems is not re- 
alistic. In practice, we need to make trade-offs between 
QoS and cost of development and between different QoS 
categories. For example, achieving very high reliability 
is not only technically difficult, it is also very costly. Fur- 
thermore, providing very high reliability will also impose 
a performance overhead. QoS requirements, such as re- 
liability, cannot be considered in isolation, they must be 
considered in the context of development cost and other 
QoS requirements, such as performance and security. 

It is also common that a technology for satisfying one 
QoS aspect will not be compatible with a technology for 
satisfying another QoS aspect. As an example, it might 
be difficult to combine a group communication mecha- 
nism for high availability with certain security mecha- 
nisms. 

To find the right solution for an enterprise computing 
system, we need to understand the cost of not satisfying 
certain QoS requirements and the cost of implementing 
and using mechanisms that provide a specific QoS level. 
To complicate things further, the cost of unsatisfactory 
QoS characteristics may vary according to the time of 
day, day of the week, and week of the year. It is also the 
case that the relative importance of specific QoS char- 
acteristics may vary over time. During day time avail- 
ability might be more important than performance due 
to online sales transaction processing. During the night, 
accounting functions might need maximum performance 
to finish within a certain time. 
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In summary, we believe that enterprises will become 
increasingly dependent on distributed systems both for 
internal business automation and for the interaction with 
other enterprises and end customers. This will not only 
require the right functionality to be provided but also 
that the systems provide adequate quality-of-service. 
There are many issues in building systems with adequate 
QoS, and we believe that QoS must be considered sys- 
tematically throughout the life-cycle of distributed en- 
terprise systems. The goal is to support QoS-enabled 
systems, and we present a QoS fabric that is an essential 
building block for such systems. 


1.1 QoS-Enabled Systems 


A QoS-enabled system can provide defined levels of 
QoS to its users. Thus, users can customize the QoS 
delivered based on their preferences. Moreover, a QoS- 
enabled system can adapt to the environment in which 
it operates. For example, it can degrade gracefully if 
resources are scarce. There are many aspects to building 
QoS-enabled systems: 


e Mechanisms: Different QoS levels are imple- 
mented by different mechanisms. Example mech- 
anisms are reliability mechanisms, such as group 
communication protocols, and security mechanisms, 
such as specific encryption protocols. Different 
mechanisms must coexist in a QoS-enabled system, 
and it must be possible to select specific mechanisms 
based on the QoS level required from the system and 
the QoS level provided by the environment. 


e QoS Awareness: A QoS-aware system is one that 
knows which levels of QoS it can deliver to its users 
and which levels of QoS it requires from its envi- 
ronment. Moreover, a QoS-aware system is able to 
communicate those QoS specifications to other en- 
tities. 

¢ QoS Agreements: To deliver predictable QoS lev- 
els, it is necessary for systems to establish agree- 
ments with other systems about delivered QoS lev- 
els. Besides being able to describe and communicate 
QoS requirements and properties, QoS agreements 
also require trading and negotiation over these re- 
quirements and properties. 


e Monitoring: The ability to monitor the QoS that 
is provided and received and to check compliance 
with existing agreements. 


e Meta Data: Information about current load, num- 
ber of deals, and available resources. 


Today, most systems are not QoS enabled. Instead, 
they provide ad hoc or best-effort QoS. Sometimes spe- 
cial security, reliability, or other mechanisms are used, 
but applications are still unaware of the QoS that is re- 


quired and provided. Figure 1 illustrates the dependen- 
cies of the different aspects of QoS enabling. 

First one needs a variety of mechanisms—such as re- 
liability and encryption protocols—that enable a system 
to satisfy QoS requirements. The installation of differ- 
ent mechanisms allows a system to provide different lev- 
els of QoS and adapt to the level of QoS provided by 
the environment. However, systems cannot install arbi- 
trary mechanisms. Mechanisms can be changed within 
the constraints of the overall systems architecture. For 
example, if a system is built according to a particular 
security architecture, we may only be able to switch be- 
tween a few reliable transports and encryption technolo- 
gies that are compatible with that security architecture. 
Furthermore, adaptation is much simpler for clients than 
it is for servers. As an example, let a server and a client 
communicate over raw TCP/IP. If we want to increase 
the availability of the server, we might decide to switch 
to a group communication protocol. From the clients’ 
perspective, this can be done quite transparently. For 
the server, however, it will have a significant impact. To 
enable group communication, servers must have func- 
tionality to join groups and transfer state that are not 
required in non-replicated systems. A system is QoS- 
enabled if it can adapt in an informed manner within 
the constraints of its overall system architecture. 

We also need the ability to establish QoS agreements 
between the various systems and system entities, and we 
need the ability to monitor compliance of QoS agree- 
ments. Establishing and monitoring QoS agreements re- 
quire that we can formalize these agreements. To that 
end, we need systems to be QoS aware. We also need 
meta data to adjust agreements based on current load 
and available resources. This paper focuses on a lan- 
guage and runtime representation to make distributed 
systems QoS aware. 


1.2 QoS-Aware Systems 


QoS-enabled systems must be QoS aware so that we 
can establish QoS agreements, both internally between 
the various system components, and externally between 
a system and its environment. 


QoS-enabled systems _ | 








— he - 
| Mechanisms: | Monitoring Agreements: | 
| = primary/backup ee a, - trading | 
- SSL | | — negotiation 
| - group communication —_ 
a Pe 
Meta data: | QoS-aware systems 
— load — specification language | 
— deals |= runtime representation 
FIG. 1. Dependencies for QoS Enabling 
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For example, consider a distributed currency trading 
system. The front-end component of the system presents 
a user interface to human currency traders. The front- 
end uses a rate service to get rate updates and a cur- 
rency trading service to perform currency trades. In or- 
der for the front-end to deliver predictable QoS to its 
human users, it must receive predictable QoS from the 
rate service and currency trading service. For example, 
the front-end may expect the rate service to be up 99 
% of the time, deliver rate updates every minute, and 
provide information for a specific set of currencies. If 
the front-end is designed to work with a particular rate 
service, these expectations can be incorporated into the 
overall system design. However, if the front-end connects 
to a rate service dynamicall, it needs to be explicit about 
these expectations and establish a QoS agreement with 
a rate service based on these expectations. 

Figure 2 illustrates the structure of this simple system 
and how QoS information need to flow dynamically in 
the system. 

To facilitate the establishment of QoS agreements in 
distributed object systems, we need a way to specify the 
QoS requirements of clients (such as the front-end of 
the currency trading system) and the QoS properties of 
services (such as the rate service). We also need a way to 
communicate these specifications and compare them to 
determine if a particular service meets the requirements 
of a particular client. 

To communicate QoS specifications between dis- 
tributed objects, specifications must be represented as 
data structures at runtime. It is possible for each pro- 
grammer to invent his own data format and construct 
QoS specifications in an ad-hoc manner by manually 
building up the appropriate data structures. Although 
possible, this is tedious, prone to error, and does not 
support interoperability. 

Constructing QoS specifications in an ad-hoc man- 
ner is tedious and complicated because of the expres- 
sive power required and because we need to compare 
specifications to determine if one satisfies another. In 
terms of expressive power, QoS specifications essentially 
consist of constraints. However, the structure of these 
constraints is fairly complex. We need constraints over 
user-defined domains with a user-defined ordering, and 
we need constraints over statistical properties, such as 
mean, variance, and percentiles. Moreover, we need to 
bind these constraints to fine-grained entities, such as op- 
eration arguments, and to coarse-grained entities, such 
as interfaces. The required expressive power also makes 
it hard to compare specifications in an ad-hoc manner. 
For example, the comparison algorithm must operate 
on user-defined domains and take user-defined orderings 
into account. 


- QoS Offer | a | 


| RateService 
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QoS Requirement . ba 
Front-end | 
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FIG. 2. Structure of the currency trading system 


The construction of QoS-enabled, open systems will 
require a common specification technique and format 
analogously to the IDL and IIOP standards. Rather 
than have each programmer invent his own format, it 
is important to develop a common representation that 
allows interoperation. 

We have defined a runtime format for QoS specifi- 
cations and a runtime library to construct and manage 
such specifications. We refer to this format and library 
as a QoS fabric, and we call our QoS fabric QRR (QoS 
runtime representation). The runtime format is defined 
in terms of CORBA IDL types, and the runtime library 
is programmed in C++. However, QRR is not inher- 
ently tied to C++ or CORBA IDL, we could also imple- 
ment the QRR fabric in JAVA and other languages, and 
DCOM and other distributed object infrastructures. We 
could even represent QML instances as XML documents. 

Instantiating QRR specifications by hand as IDL- 
defined data structures is tedious. To allow QoS specifi- 
cations to be written at a higher level, we have defined 
QML (QoS modeling language) and a QML to QRR 
compiler. To make it practical, we have integrated QML 
with existing technologies for distributed systems, such 
as interface definition languages, and design languages 
(UML). 

The paper is organized as follows. Section 2 intro- 
duces QML and its underlying concepts for QoS spec- 
ification. We then describe QRR. in Section 3.1. We 
outline how we represent the QML concepts in terms of 
C++ and CORBA IDL, and we show the architecture of 
the QRR QoS fabric. We illustrate how to use QRR. to 
implement distributed object systems with predictable 
QoS in Section 4. Finally, we give a brief overview of 
related work in Section 5, and we conclude in Section 6. 


2. QML: A Language for QoS Specification 


QML is a general-purpose QoS specification language; 
it is not tied to any particular domain, such as real-time 
or multi-media systems, or to any particular QoS cate- 
gory, such as reliability or performance. QML captures 
the fundamental concepts involved in the specification 
of QoS properties. Here, we give a brief introduction to 
these fundamental concepts. For a complete QML lan- 
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guage definition, including formal syntax and semantics, 
consult [5]. 

QML has three main abstraction mechanisms for QoS 
specification: contract type, contract, and profile. A con- 
tract type represents a specific QoS category, such as per- 
formance or reliability. Contract types a user-defined ab- 
stractions, there are no built-in contract types in QML. 
A contract type defines the dimensions that can be used 
to characterize a particular QoS category. A dimension 
has a domain of values that may be ordered. There are 
three kinds of domains: set domains, enumerated do- 
mains, and numeric domains. A contract is an instance 
of a contract type and represents a particular QoS spec- 
ification. Finally, QML profiles associate contracts with 
interface entities, such as operations, operation argu- 
ments, and operation results. 

We use the currency trading example from the intro- 
duction to illustrate the QML specification mechanisms. 
We show how to specify QoS properties for a rate ser- 
vice object. Figure 3 gives a CORBA IDL [14] interface 
definition for a rate service object. It provides an oper- 
ation, called latest, for retrieving the latest exchange 
rates with respect to two currencies. It also provides an 
operation, called analysis, that returns a forecast for 
a specified currency. The interface definition specifies 
the syntactic signature for a service but does not specify 
any semantics or non-functional aspects. Using QML, 
we can specify the QoS properties for this interface. 

The QML definitions in Figure 4 include two con- 
tract types Reliability and Performance. The 
Reliability contract type defines three numeric dimen- 
sions. The first dimension (number0fFailures) repre- 
sents the number of failures per year. The keyword 
“decreasing” indicates that a smaller number of fail- 
ures is better than a larger one. We use this dimension 
semantics to compare specifications under a “stronger 
than,” or conformance, relation. Time-to-repair (TTR) 
represents the time it takes to repair a service that 
has failed. Again, smaller values are better than larger 
ones. Finally, the dimension called availability repre- 
sents the probability that a service is available. For the 
availability dimension, larger values are better than 
smaller values. 

In Figure 4 we also define a contract called 
systemReliabilty of type Reliability. The contract 
specifies constraints over the dimensions defined in the 


interface RateServicel { 
Rates latest(in Currency ci,in Currency c2) 
raises (InvalidC) ; 


Forecast analysis(in Currency c) raises(Failed) ; 


}; 


FIG. 3. 
IDL 


The RateServicel interface described in CORBA 





type Reliability = contract { 
numberOfFailures: decreasing numeric no/year; 
TTR: decreasing numeric sec; 
availability: increasing numeric; 


}; 


type Performance = contract { 
delay: decreasing numeric msec; 
throughput: increasing numeric mb/sec; 


systemReliability = Reliability contract { 

numberOfFailures < 10 no / year; 
TTR { 

percentile 100 < 2000; 

mean < 500; 

variance < 0.3 

hs 

availability > 0.8; 


i 


rateServerProfile for RateServiceI = profile { 
require systemReliability; 
from latest require Performance contract { 
delay { 
percentile 80 < 20 msec; 
percentile 100 < 40 msec; 
mean < 15 msec 
}s 
}; 


from analysis require Performance contract { 
delay < 4000 msec 

}s 
}s 





FIG. 4. Contracts and Profile for RateServicel 


Reliability contract type. The first constraint spec- 
ifies an upper bound for the number of failures. The 
second constraint applies to the TTR dimension. This 
constraint uses statistical properties, such as mean, vari- 
ance, and percentiles, to characterize QoS along the TTR 
dimension. In QML, we refer to such statistical proper- 
ties as dimension aspects. The aspect “percentile 100 
< 2000” states that the 100th percentile must be less 
than 2000. 

The profile rateServerProfile associates contracts 
with entities in the rateServiceI interface. The 
first requirement clause in the profile states that 
the service should satisfy the previously defined 
systemReliability contract. Since the clause does not 
refer to any particular operation, it is considered a de- 
fault requirement that applies to every operation within 
the rateServicel interface. Being part of a default re- 
quirement, the systemReliability contract is called a 
default contract for the profile. Contracts for individual 
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operations are allowed only to strengthen (refine) the 
default contract. In the rateServerProfile there is no 
default performance contract; instead we associate indi- 
vidual performance contracts with the two operations of 
the RateServicel interface. For latest we specify in 
detail the distribution of delays in percentiles, as well 
as a upper bound on the mean delay. For analysis 
we specify only an upper bound and can therefore use 
a slightly simpler syntactic construction for the expres- 
sion. Since throughput is omitted for both operations, 
there are no requirements or guarantees with respect to 
this dimension. 

We have now specified example reliability and per- 
formance requirements for the rateServicel interface. 
Although the rateServerProfile is specified in terms 
of an interface (rateServicel), it characterizes the QoS 
of a particular implementation of this interface. We can 
specify multiple profiles for the same interface, and use 
distinct profiles for different implementations. The key 
to this flexibility is that QoS specifications are not em- 
bedded within an interface, but defined as separate en- 
tities. 

Intuitively we would say that the constraint “delay < 
10” is stronger than the constraint “delay < 20.” This 
relationship between the two constraints is due to the 
fact that delay is a decreasing dimension (smaller val- 
ues are better) and the fact that the value 10 is smaller 
than the value 20. In QML, we formalize this notion of 
“stronger than” for constraints and define a general con- 
formance relation over constraints. Stronger constraints 
conform to weaker constraints. We then use this con- 
formance relation on constraints to define conformance 
relations on contracts and profiles. Conformance is an 
important aspect of QoS specifications because it en- 
ables us to compare specifications based on constraint 
satisfaction rather than exact match. As we show in 
Section 4, conformance is essential for implementing a 
QoS-based trader: the QoS-based trader should select 
any service whose QoS properties conform to the client’s 
requirements, the trader should not just select the ser- 
vices whose properties are identical to the client’s re- 
quirements. 

QoS specifications can be used in many different situ- 
ations. They can be used during the design of a system 
to understand and document the QoS requirements that 
must be imposed on individual components to enable the 
system as a whole to meet its QoS goals. In [6] we show 
how to use QML at design time. The focus of this paper 
is the use of QoS specifications as first-class entities at 
runtime. 


3. QRR: A QML-Based QoS Fabric 


Our QoS fabric, QRR, is based on the following re- 
quirements: 


1. QRR should support the same fundamental con- 
cepts as QML. We want to use the same QoS specifi- 
cation concepts during design and implementation. 
Using the same concepts implies that QML’s precise, 
formal definition [5] carries over to QRR. A precise 
definition improves the interoperability of different 
QRR/QML components. 

2. Since some QoS requirements may not be known 
until runtime, it should be possible to dynamically 
create new QRR specifications. Rather than use dy- 
namic compilation for such specifications, we want 
to call generic creation functions in the QRR library. 

3. It should be possible to explicitly check consistency 
of dynamically created specifications against the 
static semantic rules of QML/QRR. The QML com- 
piler checks the rules for compiled specifications. We 
need a library function that checks the rules for dy- 
namically created specifications. 

4. Once created, there should be ways to manipulate 
QRR specifications. For example, a QoS offering 
by a server may have to be adjusted relative to the 
current execution environment to accurately reflect 
what QoS the client will actually receive. 

5. QRR should impose a minimal overhead and be scal- 
able. 

6. QRR should provide a minimal set of generic build- 
ing blocks for runtime QoS specification. In partic- 
ular, QRR specifications should be independent of 
the mechanisms and applications that use them. For 
example, in negotiation, as well as trading, we are 
interested in agreements between parties involving 
commitments from both sides. Thus we are dealing 
with structures consisting of pairs of QoS specifica- 
tions (one for each party). Rather than provide an 
agreement abstraction in QRR, we only provide the 
basic building blocks that represent QoS specifica- 
tions. It is then up to the mechanisms and appli- 
cations to use these basic building blocks to create 
composite structures. 


We are implementing QRR to satisfy these require- 
ments. Currently, we have implemented a prototype 
QML compiler and a prototype QRR library. We have 
successfully compiled QML specifications into QRR, in- 
stantiated those specifications in a CORBA environ- 
ment, communicated the specifications between dis- 
tributed components, and compared them using a con- 
formance checking function that is part of the QRR li- 
brary. 


3.1 Implementing QRR 


The QRR implementation contains a generic C++ 
library that allows applications to create QRR specifica- 
tions and to check conformance of these specifications. 
This library is linked into applications that use QRR. 
The library defines a number of data types that are used 
to represent QRR specifications in C++. These data 
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types are generated from CORBA IDL type definitions 
to facilitate the communication of QRR specifications 
between distributed CORBA objects. 

In addition to the generic library, the implementation 
also contains a QML to QRR compiler. This compiler 
emits a mix of IDL and C++ code to represent a particu- 
lar QML specification. The emitted IDL code consists of 
types that represent that QML specification. The C++ 
code contains functions to create QRR instances of the 
QML specification. The emitted IDL code is translated 
into C++ using a conventional CORBA IDL compiler. 


3.2 Representation 


We describe how to represent QML constructs in 
terms of CORBA IDL and C++. Profiles are repre- 
sented as instances of the profile struct shown in Fig- 
ure 5. They contain the profile name, the interface name, 
a sequence of default contracts (dcontracts), and a se- 
quence (profs) of structs, each associating an interface 
entity with a set of contracts. The profs sequence rep- 
resents the individual contracts of the profile. In QRR, 
all profiles are instances of the profile struct. For a 
particular profile specified in QML, the QML compiler 
emits a C++ function that constructs an instance of the 
profile struct. The C++ function also constructs and 
assigns appropriate data structures to the fields of this 
struct. 


struct profile { 
string pname; 
string iname; 


contractSeq dcontracts; 
entityProfileSeq profs; 


FIG. 5. IDL for profile 


QoS constraints are represented as instances of the 
struct constraint in Figure 7. A constraint struct 
has a sequence of aspect structs, as well as a tag 
indicating whether it is a simple constraint—such as 
“delay < 10”—or a set of aspects representing statis- 
tical characterizations. We define a separate struct type 
for each aspect kind, however, the figure only shows the 
struct used to represent mean aspects. Because IDL does 
not allow polymorphism for structs, we cannot directly 
reflect the relationship between the general notion of as- 
pects, captured by the aspect struct, and particular as- 
pect types such as mean. Instead of defining particular 
aspect types as subtypes of aspect, we define an any 
field in instances of aspect that contains a particular 
aspect instance. We also define a type tag in instances 
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of aspect that indicates which particular aspect type 
has been wrapped in the any field. 

We provide two alternative representations for con- 
tracts and contract types. In the generic representation, 
all contracts are instances of the same type, and this type 
is then part of the QRR library. In the static representa- 
tion, only contracts of the same QML contract type are 
instances of the same QRR type. In addition, the QRR 
types used for the static representation are emitted by 
the QML compiler. 

The static representation requires that the emitted 
QRR types are linked into the application that instan- 
tiates them. On the other hand, the static representa- 
tion facilitates a more efficient implementation of con- 
formance checking and other QRR functions. 

With the generic representation, applications can 
dynamically create and communicate contracts whose 
types are not known at compile time. However, manip- 
ulation and analysis of contracts is less efficient in the 
generic representation because the structure of contracts 
must be discovered dynamically. 

Although we describe them as separate representa- 
tions, our goal is to allow their simultaneous use to 
achieve maximum flexibility and performance. Since our 
current implementation only supports the static repre- 
sentation, we only give a brief overview of the generic 
representation and concentrate primarily on the static 
representation. 


enum aspectKind { 
ak_freq, ak_perc, 
ak_mean, ak_var, 
ak_simple 


}; 


struct mean { 
operators op; 
value num; 


}; 


struct aspect { 
aspectKind ak; 
any asp; 


typedef sequence <aspect> aspects; 
enum constrKind { ck_simple, ck_stat }; 
struct constraint { 


constrKind ck; 
aspects asps; 


}s 





FIG. 7. IDL for aspect and constraint 
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The IDL definitions in Figure 8 describe some ele- 
ments of the generic contract representation. In the 
generic representation, all contracts are instances of the 
struct called contract. A contract’s dimensions are then 
represented as a sequence of structs of type dimension. 
A contract has a type identifier (tid) that refers to its 
contract type. Contract types are built from various 
generic structs that capture all the information about 
domains and their ordering. These type representations 
are quite elaborate as they contain information about 
all values, how these values are ordered, and whether 
the dimension is increasing or decreasing. Due to space 
constraints we do not describe these type structures in 
detail in this paper. 

With the static representation, the QRR compiler will 
map a QML contract type into a number of C++ classes 
and an IDL struct. The C++ classes represent the con- 
tract type itself. They contain information about the 
domain elements and the domain ordering for the di- 
mensions defined in the contract type. The IDL struct 
is used to represent contracts that are instances of the 
contract type. An instance of the emitted IDL struct 
represents a particular contract. 


contractType | 
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FIG. 6. Class diagram for contract type representation. 
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struct dimension { 
string name; 
constraint constr; 


}; 


struct contract { 
tid, ct: 
sequence<dimension> dims; 


}; 


FIG. 8. IDL for generic contracts 





The emitted C++ classes inherit from, and adds to, 
a set of contract type base classes implemented in the 
QRR library. Figure 6 shows—using UML [3] notation— 
a simplified view of the C++ classes for contract types 
in the static representation. Emitted classes are grayed 
and classes defined in the QRR library are white. 

Sub-classes of contractType represent the emitted 
classes for specific contract types. These classes con- 
tain data members that represent the dimensions in the 
contract type. They also contain a conformance check- 
ing function. Since the conformance checking function 
is emitted on a per-contract type basis, it can directly 
refer to the type’s dimensions as data members. 

If the contract types contain set or enumeration do- 
mains that are ordered, we also emit C++ classes that 
represent these domains. The domain classes are sub- 
classes of the library class called domain. The main role 
of emitted domain classes is to provide information about 
the domain ordering. 

In addition to the C++ classes, the compiler also 
emits an IDL struct definition for each contract type. 
The name of this struct is the contract type name with -i 
appended to it. The struct has one field for each dimen- 
sion. Each field has the same name as the corresponding 
dimension and is of type constraint. 

In Figure 10 we show a QML contract type called 
Reliability and the corresponding emitted IDL struct. 

An instance of the Reliability_i struct will hold 
instances of constraints that in turn hold the aspects 
specified for each individual constraint. An instance also 
contains the type identifier of its contract type. 

Given a type definition, such as Reliability i, the 
programmer could in principle create contracts at run- 
time by instantiating the type and building up appro- 
priate structures for the fields in the struct. However, 
this is tedious. To automate the instantiation process, 
the QML compiler emits instantiation functions for each 
contract and profile declared in QML. 

To give a concrete, but simple, example of how these 
instantiation functions are constructed, we provide a 
contract in Figure 11 and the corresponding emitted 
construction function in Figure 13. Running these sim- 
ple definitions through the QML to QRR compiler will 
produce the static representation which is a struct with 
name T_i. The compiler also emits the C++ class T, 
which describes the contract type and implements con- 
formance checking. In addition, it produces a function— 
shown in Figure 13—with the same name as the specified 
contract (in this case C). When C is called it will return 
an instance of T_i representing the constraints specified 
in C. The C function uses the same lower level functions 
as are provided to applications that manually composes 
contracts and profiles. Similarly, we produce functions 
for profiles that build up the corresponding QRR struc- 
tures. 
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3.3 Library Functions 


When an application needs to check conformance, it 
invokes the library function conformsTo whose signa- 
ture is shown in Figure 12. This function takes two 
profiles, and checks conformance between their con- 
tracts. Inside profiles, contracts are stored as a pair 
consisting of a contract type name and an element of 
type any. For a performance contract, the any ele- 
ment will contain an instance of type Performance_i 
and the contract type name will be the string “Perfor- 
mance”. To check conformance between performance 
contracts, the conformsTo will use the string “Perfor- 
mance” to lookup the C++ object which represents 
performance contract types at runtime. This object 
is of type Performance and will have a virtual func- 
tion called conformsTo (the signature of this function is 
given in Figure 12 as Performance: :conformsTo). The 
Performance: :conformsTo function is emitted. It ex- 
pects two any arguments that both contain instances 
of the struct Performance_i. Since it is emitted, the 
Performance: :conformsTo function knows which ob- 
jects to extract from the any arguments. 


type Reliability = contract { 
number0fFailures: decreasing numeric; 
TTR: decreasing numeric; 
availability: increasing numeric; 


}; 


struct Reliability_i { 
tid: ct 
numberOfFailures constraint; 
TTR constraint; 
availability constraint; 


FIG. 10. IDL for statically generated contracts 


type T = contract { 
1 : increasing numeric msec; 
s : enum {initial, amnesia, 
noguarantee, rolledback}; 


}; 


C = T contract { 
1 { 
percentile 40 < 50 ; 
mean < 20 


’ 
= amnesia; 


FIG. 11. A simple contract type and contract 








int conformsTo(profile &stronger,profile &weaker) ; 


int Performance: :conformsTo(CORBA::Any *stronger, 
CORBA: :Any *weaker) ; 


int checkSem(const profile &p); 





FIG. 12. Some library function signatures 


The programming model also provides a variety of 
convenience functions for creating aspects, contracts and 
other QRR/QML constructs. 


3.4. Programming Model 


To give the reader a better feel for the program- 
ming model offered by QRR, we describe a simple QoS 
compatibility-checking mechanism that allows a client to 
send its QoS requirements in the form of a QRR. profile 
to a server. The server checks whether it can satisfy the 
client’s requirements. 


interface QoSAware { 
exception invalidProfile{}; 


boolean compatible(in profile p) 
raises (invalidProfile) ; 


Fy 





FIG. 9. QoSAware interface 


To support the QoS-checking mechanism, the server 
implements the interface QoSAware, which we describe 


Ti * cO{ 
Ti * .C; _C = new T_i; 
-C->ct = CORBA::string_dup("T") ; 
//Create aspects for 1 
aspect * _1; 
_C->l.asps.length(2) ; 
_C=->1.ck = ck_stat; 
-1 = qml_perc_asp(le,40, (float) 50) ; 
_C->l.asps[0] = *_1; 
delete _1; 
1 = qml_mean_asp(le, (float) 20) ; 
-C->l.asps[1] = *_1; 
delete _1; 
//Create simple for s; 
aspect * _s; 
_C->s.asps.length(1) ; 
_C->s.ck = ck_simple; 
-S = qml_simp_constr (eq, (float) 2/*amnesia*/) ; 
-C->s.asps[0] = *_s; 
delete _s; 
return _C; 





FIG. 13. Emitted function that creates a T contract 
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CORBA: :Environment env; 

profile * p = i2_prof(); 

if (I2ref->compatible(*p,env) { 
//0K to use this server 


\ else { 
//use another server 


} 
FIG. 14. Client call 


in Figure 9. The operation compatible allows the client 
to send the profile it requires to the server. The server 
responds with true if the client’s requirements and the 
server’s capabilities are compatible; and with false other- 
wise. If the profile is semantically invalid, the operation 
raises an exception. 

To make the QoS checking more concrete, let us as- 
sume that a server A provides an interface J; and uses 
a server B that implements an interface I. We can de- 
scribe, in QML, the requirements of server A on server 
B as a profile for the interface I,. We can also describe 
the QoS provided by A as a profile for interface J). Hav- 
ing defined those profiles and the contracts that they use 
we can emit QRR code that can be compiled and linked 
with both servers. Notice that server A plays the role of 
client relative to server B. 

We can create the specified profiles in server A by in- 
voking the emitted functions that have the same names 
as the profiles specified in QML. If we have a profile 
named i2_prof specifying A’s requirements on In, A ob- 
jects would use an emitted function called i2-prof to 
create an QRR instance of this profile. The C++ code 
in Figure 14 illustrates how a profile can be created and 
sent with an ordinary CORBA request. 


CORBA: :Boolean B_serverImp1: :compatible( 
const profile &p, 
CORBA: :Environment &_ev) 


if (! checkSem(p1) { 
throw QoSAware: :invalidProfile() ; 
} 


if (conformsTo(myprof() ,p1)) { 
cout << "Conformance... " << endl; 
return 1; 


} 


else { 
cout << "Non-conformance..." << endl; 
return 0; 


FIG. 15. Server implementation 
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The implementation of compatible simply takes the 
profile specified for the server and checks its conformance 
to the profile supplied by the client. We assume that 
the server obtains its own profile by invoking a func- 
tion called myprof. The implementation of compatible 
checks the static semantics of the profile before doing 
performance checking. In the future we intend to in- 
clude information in a profile that allows a program to 
determine whether a profile has already been checked for 
semantic validity or not. With this extra information, 
we can avoid redundant semantic checks. Figure 15 de- 
scribes a simple implementation of a server that supports 
the QoSAware interface. 


3.5 Discussion 


In the generic mapping, contracts can be created, em- 
bedded in profiles, and communicated even if they are 
not statically known. If we use this mapping, we would 
initially send contract type descriptions for all contract 
types to be used during a session. This would ensure 
that each participating object has all contract type de- 
scriptions available. Using the static mapping, on the 
other hand, we require that all contract types are known 
statically and compiled into the participating objects. If 
a received contract is an instance of an unknown type, 
the receiver can do little more but raise an exception. 

We could require that it is decided up front whether 
the generic or static mapping will be used during a ses- 
sion. Having a strict separation of the generic and static 
mapping, and only use one of them during a particular 
session, would simplify the implementation, but it would 
also be quite inflexible. In the case of the generic map- 
ping, it also imposes unnecessary performance overhead 
when contract types are already known. 

Global consistency of types is another issue that we 
are considering. It is necessary to have a mechanism 


for identifying and comparing types to determine if they 


are the same. The current implementation uses an over- 
simplifying approach based on text strings. Our plans 
for future work includes leverage from previous solutions 
such as the handling of types in CORBA to resolve this 
issue in QRR. 


4. Example: A Qos-Based Trader 


This section illustrates how QRR can be used to con- 
struct higher-level QoS components. We show how to 
build a QoS-based trader, and explain its utility in the 
context of the currency trading example from the intro- 
duction section (do not confuse a QoS-based trader with 
the currency trading service in the currency trading sys- 
tem). 

The purpose of this section is not to discuss or advo- 
cate the merits of QoS-based trading, but to show con- 
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cretely the expressive power and flexibility of QRR. The 
constructs of QRR. make it relatively straightforward to 
implement QoS-aware components that would otherwise 
be complicated to build. For another example, see [10] 
in which a QoS negotiation mechanism is presented. The 
mechanism uses QML and QRR as its underlying mech- 
anisms for exchange of QoS specifications. 


4.1 Trading in Distributed Systems 


A conventional trader [15] in distributed systems facil- 
itates the binding between clients and services. Services 
register with the trader, and clients query the trader to 
find a service that satisfies certain criteria. We show an 
example interface to a very simple conventional trader 
in Figure 16. 

Services register themselves by calling the offer 
method on the trader. A service passes a description 
of its properties—the properties that can be used for 
service selection—and a reference to itself. The trader 
returns an offer identifier to the service. The service 
can use this identifier to later withdraw its offer by in- 
voking the withdraw method. Clients call the methods 
find and findA11 to obtain service references. The find 
method returns a single service reference. The found ser- 
vice satisfies the criteria passed in as parameter to the 
find call. The findAll method returns a list of ser- 
vice references; it returns all the services that match the 
criteria passed in as parameter. 

In general, a trader matches criteria passed in by 
clients against properties passed in by services. Conven- 
tional trader services typically use name-value pairs for 
server selection. Services pass in one such attribute list 
when they register, and clients pass in an attribute list 
that is matched against such service attribute lists. Typ- 
ically, matching is based on name-value equality, with 
the provision that the client’s attribute list must be equal 
to a sub-list of the server’s attribute list. 


4.2. QoS-Based Trading 


The main idea behind QoS-based trading is to use 
QoS specifications as service properties and client cri- 
teria, and to use specification conformance to perform 
the criteria-to-properties matching. We show that with 


interface Trader { 
OfferId offer(in ServiceProperties sp, 
in Object obj) raises (invalidOffer) ; 


Match find(in Criteria cr) raises(noMatch) ; 
MatchSeq findAll(in Criteria cr) raises(noMatch) ; 
void withdraw(in OfferId 0) raises(noMatch) ; 


FIG. 16. The interface of a simple conventional trader 
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QRR it is relatively straightforward to implement a QoS- 
based trader. 

We want to use the QoS-based trader to establish QoS 
agreements between clients and services in distributed 
object systems. A QoS agreement is a contract between 
a client and a service. In our discussion so far, we have 
talked about client requirements and service properties. 
This is a somewhat simplified picture. The service prop- 
erties may depend on the way in which the client uses 
the service. For example, the throughput that a service 
can provide may depend on how frequently a client calls 
the service. So, for performance, a QoS contract may 
involve requirements and properties for both clients and 
services. 

Thus in general, the ServiceProperties argument 
to the offer method in a QoS-based trader will in- 
volve service properties and requirements. The service 
can provide its properties if the client satisfies the re- 
quirements. Figure 17 gives a possible structure for 
ServiceProperties using QRR. We describe the service 
requirements and properties using profiles. Figure 17 
also shows the structure of client criteria. These also 
have a two profiles: one representing client requirements 
and one representing client properties. 

With the conformsTo function on profiles introduced 
in Section 3.1, we can now implement the matching 
procedure in the QoS-based trader. We illustrate the 
conformance-based matching procedure in Figure 18. 
The procedure iterates through the registered services 
and for each service checks whether that service satis- 
fies the criteria passed in as arguments. The function 
conformsTo takes two profiles and determines whether 
the first profile conforms to the second profile. To find a 
matching service, the find method checks whether the 
server properties conform to the client requirements, and 
it checks whether the client properties conform to the 
server requirements. 

An alternative implementation strategy would be to 
use a conventional trader and represent QoS specifica- 
tions as name-value pairs. However, we would then be 
limited by the expressive power of name value pairs; it 
is not clear how to elegantly represent the concepts of 
QML and QRR in terms of name-value pairs. More seri- 
ously, with a conventional trader we would use equality 
for server selection. It is essential that we can select a 
service even if the client’s requirements are not equal to 
the service’s properties: we want to select a service as 
long as the service’s properties satisfies, or conforms to, 
the client’s requirements. 


4.3 Using a QoS-Based Trader 
Here, we use a QoS-based trader to implement the 


currency trading system from the introduction section. 
Currency trading is a complex process that requires sig- 
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struct ServiceProperties { 
profile properties; 
profile requirements; 


3 


struct Criteria { 
profile properties; 
profile requirements; 


} 


FIG. 17. The structure of ServiceProperties and Criteria 


nificant information and analysis [13]. Although we 
appreciate the complexity of systems supporting such 
trades, we have to simplify the problem for the purpose 
of this paper. 

As a reminder, our simple currency trading system 
consists of three logical components: a front-end that 
serves as a user interface, a rate service that provides in- 
formation about current exchange rates, and a currency 
trading service to execute currency trades. We will focus 
only on the rate service that provides exchange rate in- 
formation. The structure of this simple currency trading 
systems was shown in Figure 2. 

Assume that we want to build a currency trading sys- 
tem that uses a rate service available on the Internet. 
Different service providers may implement different rate 
services with the same basic functionality, but with dif- 
ferent QoS properties. For example, one rate service may 
provide frequent exchange rate updates and thus higher 
precision. Another rate service may provide less frequent 
update, which implies that the information will not be 
as accurate. Different services may provide information 
for different currencies. Moreover, one rate service may 
be highly available and expensive to use, whereas an- 
other rate service may be more unreliable but cheaper 


// C++ Implementation sketch of the find method 

// in a QoS-based trader 

Match QoS-Trader::find(const Criteria &cr) 

throw (noMatch) 

{ 

Servicelterator it = ...; 

for(it.init(); ! it.done(); it.advance()) 
serviceProperties sp = 


it.currentServiceProperties () ; 
if (conformsTo(sp.properties,cr.requirements) && 
conformsTo(sp.requirements,cr.properties)) { 
return it.currentServiceMatch() ; 


} 


throw noMatch() ; 


}; 


FIG. 18. Matching based on conformance 
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to use. The point is that one size does not fit all: dif- 
ferent clients have different requirements or expectations 
about the QoS delivered by a rate service, and different 
clients are willing to pay for high levels of QoS, whereas 
other clients are not. 

The QoS-based trader provides a mechanism for per- 
forming server selection in this kind of environment. The 
various rate services register themselves with the QoS- 
based trader and provide QoS specifications that reflect 
their particular notion of QoS. Clients then consult the 
QoS-based trader when they want to connect to a rate 
service. In doing so, clients communicate their QoS ex- 
pectations to the QoS-based trader to select a suitable 
service. 

Figure 19 shows the structure of the currency trading 
system with a QoS-based trader. We could of course se- 
lect between multiple currency trading services as well, 
but we want to simplify the example and focus on rate 
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FIG. 19. Structure of the currency trading system with a 
QoS-based trader 


type DataQuality = contract { 
currencies: set { USD, JPY, SEK, FIM , DKK, DEM 
ITL, KGS, EEK, KYD, BWP, LUF, FRF, 
GBP, QAR, RUR, TOP, MAD, SAR }; 
updateFrequency: increasing numeric updates/min; 


Vn 


type Reliability = contract { 
availability: increasing numeric; 
referenceValidity = increasing 
enum { invalid, valid } 
with order { invalid < valid }; 


e 


type ClientPriceBound = contract { 
costPerInvocation: decresing numeric cent/call; 
costPerHour: decreasing numeric cent/hour; 


} 


FIG. 20. Contract types for the currency trading system 
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services. The issues involved in selecting a currency trad- 
ing service are similar. 

In the following we show how to use the concepts of 
QML and QRR to create QoS specifications for a couple 
of rate services and for a front-end. Due to space con- 
straints, we provide a somewhat simplified version of the 
various QoS specifications. 

First, we need a number of contract types to repre- 
sent the various QoS categories under consideration in 
the QoS agreement between the front-end and a rate 
service. Figure 20 outlines these contract types. The 
first type, called DataQuality captures the notion of 
data quality: accuracy and content. The currencies 
dimension reflects the currencies that a particular ser- 
vice can provide information on. The updateFrequency 
reflects how often this rate information is updated at 
the server and thus gives an indication of the precision. 
The Reliability contract type captures the reliability 
of the rate service. In our previous work, we identified a 
number of dimensions to characterize reliability for dis- 
tributed object systems [9]. Here, we only use a very 
simple characterization. We use two dimensions: avail- 
ability is the probability that the server is up when try- 
ing to contact it and reference validity states whether 
the object reference to the server remains valid after 
the server has crashed and come back up. Finally, the 
ClientPriceBound contract type captures how much the 
front-end is willing to pay for the rate service. Specifi- 


serviceiprops for rateService = profile { 
require DataQuality contract { 
currency >= { USD, JPY, SEK, FIM , DKK, DEM 
ITL, KGS, EEK, KYD, BWP, LUF, FRF, 
RUR, TOP, MAD, SAR }; 


GBP, QAR, 
updateFrequency >= 1; 
}; 

require Reliability contract { 
availability >= 0.99; 
referenceValidity == valid; 

}i 

require Price contract { 
costPerInvocation <= 100; 

}; 

}; 


service2props for rateService = profile { 
require DataQuality contract { 
currencies == { SEK, FIM, DKK, MAD, GBP }; 
updateFrequency >= 60; 
}; 
require Price contract { 
costPerInvocation <= 50; 





FIG. 21. Profiles for rate services 
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cations of this type can characterize an upper bound, or 
the exact price, for the service cost as seen by the client. 
An upper bound does not determine the actual price the 
service charges, it merely serves as a first-order filtering 
criterion when matching up clients and services. The 
actual determination of how to charge for a service may 
involve more sophisticated negotiation protocols that we 
do not consider here. Payment can be specified both 
as a per-invocation cost and a per-session cost, where 
the per-session cost is determined from the length of the 
session. 

Given these basic contract types, we can now start to 
define the QoS properties of services. To simplify the 
example, we only specify service properties and client 
requirements. In other words, we do not consider client 
properties and server requirements. Moreover, we spec- 
ify the contracts as default’ contracts—contracts that 
apply to the rate service object rather than individual 
methods in this object. 

In Figure 21, we outline the QoS properties for two 
different rate services. Notice that both services imple- 
ment the RateService interface defined in Figure 3. The 
first service is characterized by the service1props pro- 
file. This service is a highly available, expensive ser- 
vice that supports many currencies with a low update 
frequency. The second service is characterized by the 
service2props profile. The second service does not pro- 
vide any reliability guarantees. On the other hand it is 
fairly cheap to use. It supports only a few currencies, but 
the frequency of updates is high. Both services can only 
charge based on a per-invocation scheme, they cannot 
charge for entire sessions. 

We can now create profiles, using QRR, during server 
initialization. One service will create a profile according 
to the servicelprops specification, and the other ser- 
vice will create a profile according to the service2props 
specification. We can create the QRR profiles by call- 
ing the profile-creation functions emitted by the QRR 
compiler. Alternatively, we can bypass the QRR com- 
piler and create the profiles by calling the QRR library 
functions directly. For simplicity we assume that each 
service has a single profile. In a more realistic situation, 
each service may have multiple profiles to handle service 
differentiation and QoS variations over time. 

Both services will register with the QoS-based trader, 
and when registering, they will provide their respective 
profiles. In a dynamic environment services can with- 
draw old offers that can no longer be supported and make 
new offers. 

We also need to specify the requirements of the front- 
end. We give an example of such a specification in Fig- 
ure 22. According to the figure, the front-end has no reli- 
ability requirements, it is willing to pay medium cost on 
a per-invocation basis, it requires rates for only Swedish 
Crowns, Danish Crowns, and British Pounds, and it re- 


USENIX Association 


quires relatively high update frequency. This specifica- 
tion can either reflect the requirements of the front-end 
object as such, or it can represent the requirements of 
a user of the front-end object. In the first case, we can 
write the profile in QML and generate profile-creation 
functions based on the QML specification—we know the 
QoS requirements when we implement the front-end. In 
the second case, where the requirements reflect user re- 
quirements, we do not know the requirements until run- 
time, and we cannot compile profile creation functions 
into the front-end. In this case, we have to call the 
generic profile creation functions in QRR to dynamically 
create a profile that reflects the user’s requirements. 

Once the front-end has created a profile that reflects 
its requirements, it can then call find on the QoS-based 
trader to obtain a reference to a rate service object that 
satisfies those requirements. In our case, the front-end’s 
requirements will give rise to selection of service number 
two. The profile service2props conform to the profile 
frontendRegs, where as the profile service1props does 
not. 


4.4 Discussion 


A QoS-based trader facilitates the server-selection 
process in open systems where clients are not built to 
work with one particular server. In an open system, 
clients must be prepared to select from a range of differ- 
ent services that provide the same functionality at dif- 
ferent levels of QoS. The QRR fabric makes it relatively 
straightforward to implement a QoS-based trader. In 
contrast, it is non-trivial to implement a similar func- 
tionality using a conventional trader with name-value 
pairs. 

The QoS-based trader that we presented here does 
not solve the whole issue of establishing QoS agreements 
between clients and services. We ignored the issue of 
agreement duration. Furthermore, we only touched on 
the topic of payment for QoS. In the example, we de- 
scribed a very simple way to perform a first-order screen- 
ing based on how much clients are willing to pay. What 
clients actually pay may be less than this upper bound, 
and will probably be the result of further negotiation 


frontEndReqs for rateService = profile { 
require DataQuality contract { 
currency == { SEK, DKK, GBP}; 

updateFrequency >= 10; 


}s 
require Price contract { 
costPerInvocation <= 75; 
ie 
} 


FIG. 22. Front-end profile 





between the client, and the server given that the upper 
bound is satisfied. Finally, the trader does not address 
the issue that services may provide different levels of QoS 
depending on the dynamic environment. For example, 
a service may be able to provide higher levels of QoS in 
an environment with plentiful resources and few clients. 
We believe that these issues can be addressed as exten- 
sions to the simple QoS-based trader that we described. 
The issues will not change the basic functionality of the 
trader. Many of the issues can possibly be addressed as 
separate QoS components that complement the trader. 


5. Related Work 


Generally, interface definition languages, such as 
OMG IDL [14], specify functional properties, but lack 
any notion of QoS. 

TINA ODL [20] is different in that it allows program- 
mers to associate QoS requirements with streams and 
operations. A major difference between TINA ODL and 
our approach is that they syntactically include QoS re- 
quirements within interface definitions. Thus, in TINA 
ODL, one cannot associate different QoS properties with 
different implementations of the same functional inter- 
face. 

Similarly, Becker and Geihs [1] extend CORBA IDL 
with constructs for QoS characterizations. Their ap- 
proach suffers from the same problem as TINA ODL: 
they statically bind QoS characterizations to interface 
definitions. They also allow QoS characteristics to be as- 
sociated only with interfaces, not individual operations. 
In addition, they support only limited domains and do 
not allow enumerations or sets. Finally, they allow in- 
heritance between QoS specifications, but it is unclear 
what constraints they enforce to ensure conformance. 
QoS specifications are exchanged as instantiations of IDL 
types without any particular structure. 

There are a number of languages that support QoS 
specification within a single QoS category. The SDL 
language [11] has been extended to include specification 
of temporal aspects. The RTSynchronizer programming 
construct allows modular specification of real-time prop- 
erties [17]. In [7], a constraint logic formalism is used to 
specify real-time constraints. These languages are all 
tied to one particular QoS category, namely timing. In 
contrast, QML and QRR are general purpose; QoS cat- 
egories are user-defined types in QML, and can be used 
to specify QoS properties within arbitrary categories. 

The specification and implementation of QoS con- 
straints have received a great deal of attention within the 
domain of multimedia systems. In [18], QoS constraints 
are given as separate specifications in the form of enti- 
ties called QoS Synchronizers. A QoS Synchronizer is 
a distinct entity that implements QoS constraints for a 
group of objects. The use of QoS Synchronizers assumes 
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that QoS constraints can be implemented by delaying, 
reordering, or deleting the messages sent between objects 
in the group. In contrast to QML, QoS Synchronizers 
not only specify the QoS constraints, they also enforce 
them. The approach in [19] is to develop specifications 
of multimedia systems based on the separation of con- 
tent, view, and quality. The specifications are expressed 
in Z. The specifications are not executable per se, but 
they can be used to derive implementations. In [2], mul- 
timedia QoS constraints are described using a temporal, 
real-time logic, called QTL. The use of a temporal logic 
assumes that QoS constraints can be expressed in terms 
of the relative or absolute timing of events. Campbell [4] 
proposes pre-defined C-language structs that can be in- 
stantiated as QoS specifications for multimedia streams. 
The expressiveness of the specifications are limited by 
the C language, thus there is no support for statistical 
distributions. Campbell does, however, introduce sep- 
arate attributes for capturing statistical guarantees. It 
should be noted that Campbell does not claim to address 
the general specification problem. In fact, he identifies 
the need for more expressive specification mechanisms 
that include statistical characterizations. In contrast to 
QML, the multimedia-specific approaches only address 
QoS within a single domain (multimedia). Moreover, 
these approaches tend to assume stream-based commu- 
nication rather than method invocation. 

Zinky et al. (23, 22] present a general framework, 
called QuO, to implement QoS-enabled distributed ob- 
ject systems. The notion of a connection between a client 
and a server is a fundamental concept in their frame- 
work. A connection is essentially a QoS-aware commu- 
nication channel; the expected and measured QoS behav- 
iors of a connection are characterized through a number 
of QoS regions. A region is a predicate over measurable 
connection quantities, such as latency and throughput. 
There approach does not seem to enable dynamic cre- 
ation, communication, and manipulation of QoS specifi- 
cations. In particular, it is not clear how to use their ap- 
proach to dynamically establish connections in an open 
environment based on QoS needs and provisions. 

In [21] Zinky and Bakken discuss the problem of man- 
aging meta-information in systems with adaptable QoS. 
The paper discusses various kinds of data that is needed 
for adaptive CORBA systems. They do not, however, 
present any concrete way in which the information can 
be described and communicated. We believe that QML 
and QRR can be used to described several of the facets— 
such as kind and mechanism—identified in the paper. 

Linnhoff-Popien and Thissen [12] describe methods 
for evaluating the distance from the ideal characteristics 
of a required service to what the available offers provide. 
They use computed distances to select the most appro- 
priate service. In contrast, QML and QRR focuses on 
statistical characterization of QoS and systematic com- 


82 5th USENIX Conference on Object-Oriented Technologies and Systems (COOTS '99) 


parison by means of conformance. We could extend our 
approcah to incorporate a notion of “preference” if many 
services satisfy a client’s requirements. The area of util- 
ity theory is a promising foundation for such an exten- 
sion. 

Within the Object Management Group (OMG) there 
is an ongoing effort to specify what is required to ex- 
tend CORBA [14] to support QoS-enabled applications. 
The current status of the OMG QoS effort is described 
in [16], which presents a set of questions on QoS spec- 
ification and interfaces. We believe that our approach 
provides an effective answer to some of these questions. 
ISO has an ongoing activity aiming at the definition of 
a reference model for QoS in open distributed systems. 
In a recent working paper [8] they outline how various 
dimensions such as delay and reliability could be char- 
acterized. They lack, however, any proposal or recom- 
mendations for representations or languages with which 
such constraints can be expressed and communicated. 


6. Conclusion 


We believe that one of the next major advances of dis- 
tributed object systems is to make them QoS enabled. 
An important step towards QoS enabling is to facilitate 
QoS characterizations of distributed object components. 
We have previously suggested the QML language for this 
purpose [6]. When we have characterizations we need 
to allow these to influence how distributed objects are 
connected and what underlying communication mecha- 
nism and transports they use. Currently, these decisions 
are typically made at design time and hardwired into 
the system. However, to build flexible applications that 
execute in internet-like environments, we need to sup- 
port dynamic connections based on QoS matching. The 
dynamic connections can be facilitated by QoS compo- 
nents, such as traders and negotiators. A QoS char- 
acterizations is of little use if we can not verify that 
components of a system actually complies to the QoS 
agreements that have been set up among them. This 
can be accomplished by monitoring connections between 
objects. 

Trading, negotiation, monitoring and many other 
functions of QoS-enabled distributed systems require a 
format for exchanging QoS specifications. We have de- 
signed and are implementing a language (QML) and a 
run-time representation (QRR) that is tailored for mak- 
ing distributed object systems QoS aware. One of the 
main advantages of using QRR instead of ad-hoc runtime 
representations is that QRR comes with a precisely de- 
fined notion of conformance. Moreover, the QRR library 
comes with a generic conformance checking function. Al- 
though conformance appears somewhat manageable for 
simple constraints over numeric dimensions, it is chal- 
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lenging to define and check conformance for statistical 
aspects and set domains with user-defined ordering. 

QML and QRR allow middle-ware developers to in- 
vent new mechanisms and services for QoS-enabled dis- 
tributed systems. 

OMG has standardized—among other things— 
CORBA IDL and IIOP to facilitate interoperability of 
heterogeneous distributed objects. Analogously, we be- 
lieve that open systems can only meet QoS requirements 
if they can specify and communicate their QoS char- 
acteristics and requirements. QML and QRR could be 
viewed as a first attempt to come up with a common 
specification language and inter-change format for QoS 
enabled distributed object systems. 
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Abstract 


While object-relational database servers can be 
extended with user-defined functions (UDFs), the 
security of the server may be compromised by these 
extensions. The use of Java to implement the UDFs is 
promising because it addresses some security concerns. 
However. it still permits interference between different 
users through the uncontrolled consumption of 
resources. In this paper, we explore the use of a Java 
resource management mechanism (JRes) to monitor 
resource consumption and enforce usage constraints. 
JRes enhances the security of the database server in the 
presence of extensions allowing for (i) detection and 
neutralization of denial-of-service attacks aimed at 
resource monopolization, (ii) monitoring resource 
consumption which enables precise billing of users 
relying on UDFs, and (iii) obtaining feedback that can 
be used for adaptive query optimization. 


The feedback can be utilized either by the UDFs 
themselves or by the database system to dynamically 
modify the query execution plan. Both models have 
been prototyped in the Cornell Predator database 
system. We describe the implementation techniques, 
and present experiments that demonstrate the effects of 
the adaptive behavior facilitated by JRes. We conclude 
that. minimally, a database system supporting 
extensions should have a built-in resource monitoring 
and controlling mechanism. Moreover, in order to fully 
exploit information provided by the resource control 
mechanisms, both the query optimizer and the UDFs 
themselves should have access to this information. 


1. Introduction 


There has been much recent interest in using Java to 
implement database extensions. The SQL-J proposal 
[SQLJ] describes efforts by database vendors to support 
user-defined functions (UDFs) written in Java. Java 
UDFs are considered relevant in environments like 
internets and intranets, where large numbers of users 
extend a database server backend. In earlier work 
[GMS+98], we explored some of the security, 
portability, and efficiency issues that arise with Java 


UDFs. The main observation was that although Java 
UDFs are efficient, they do not solve all the security 
problems that arise when a server accepts untrusted 
extensions. Specifically, short of creating a process per 
UDF, there is no suitable mechanism to prevent one 
UDF from allocating large amounts of memory or using 
a large portion of the CPU time. This allows a 
malicious or buggy UDF to effectively deny service to 
all the other users of the database system. Another 
problem directly and negatively affecting deployment 
of Java-UDF-enabled database systems is the lack of an 
infrastructure for monitoring resource consumption and 
billing users for resources consumed by their UDFs. 


In this paper, we describe the application of a Java 
resource accounting interface, JRes [CVE98]. to address 
this issue. JRes has been incorporated into the Cornell 
Predator database system [Sesh98a] as part of the 
Jaguar project, and we base our observations on this 
prototype. To the best of our knowledge the resulting 
system is the first database where extensibility based on 
a safe language is augmented with an ability to monitor 
usage of computational resources (we note that similar 
concurrent efforts are being made by vendors of several 
relational systems). In particular, our work further 
limits the amount of trust that the database server must 
have with respect to the behavior of extensions. Duc to 
using a safe language, our previous work ensured that 
the server is protected from extensions and that the 
extensions are protected from one another. At the same 
time, the benefits of executing all participating entities 
in a single address space can be exploited. This paper 
demonstrates how a class of UDFs that may execute in 
a database server without affecting the execution of the 
server or other extensions can be enlarged to contain 
UDFs with unknown and potentially malicious or 
unbalanced resource requirements. 


Furthermore, we question two implicit assumptions 
underlying previous work on optimizing queries with 
user defined functions: (i) that the costs and completion 
time of invoking a UDF will remain constant over the 
execution of the entire query, and (ii) that it is possible 
to provide realistic estimates on the costs of UDFs. A 
query executing on large tables and using costly UDFs 
will execute long enough that considerable fluctuations 
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in resource availability are likely to be observed while 
the query is running. Consequently, the relative weights 
associated with different types of resources will change. 
Expensive UDFs also often execute complex code, 
making it difficult to accurately predict their cost. 
Finally. database cost estimates are typically not 
absolute: rather they simply need to be accurate relative 
to each other on some cost scale used by the database 
system developers (and usually not quantified in terms 
of real time). The user defining a new UDF has no way 
to position it on this internal cost scale. 


Our work addresses some of these concerns. JRes 
provides feedback for adaptive query optimization by 
monitoring the use of resources by each UDF. 
Depending on the adopted system design, either each 
UDF requests information about resource consumption 
and adapts its runtime behavior accordingly, or the 
database server uses the feedback from the resource 
monitor to adapt the query’s execution. Each model is 
desirable in certain situations, leading to the conclusion 
that a database system needs to support both models of 
resource control feedback. 


The rest of the paper is structured as follows. An 
example-based motivation of our work is contained in 
the next section. This is followed by a description of 
selected details on Jaguar and JRes - systems used for 
experimentation in this study. Section 4 outlines a 
design space of applicability of dynamic resource 
controlling mechanisms for user defined functions. 
Section 5 shows how resource-limiting policies can be 
defined for Java UDFs. Taking advantage of resource 
availability feedback is discussed in Sections 6 and 7. 
This is followed by a discussion of related work and 
finally by conclusions. 


2. Motivation 


In order to justify the need for management of 
computational resources in extensible database servers 
let us consider the following example. An amateur 
investor is planning future stock acquisitions and has 
purchased access to a database server that can be 
extended with used defined functions coded in Java. 
Among other data, users of the server can access the 
table Companies, which lists firms whose stock is 
currently sold and bought on the New York Stock 
Exchange. The table has two columns of interest for the 
investor: Name (the name of a company) and 
ClosingPrices. which is an array of numbers 
corresponding to company’s share prices. The array 
contains an entry for every day since the company 
entered the stock market. 


The investor wants to find companies that meet all the 
following requirements: (i) the company is on the 
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market for at least forty days, (ii) the price of a share 
forty days ago is smaller than the price today, and (iii) 
on any given day during the last thirty nine days the 
price has not changed by more than 2% from the 
previous day. This can be expressed as the following 
SQL query: 

SELECT C.Name 


FROM Companies C 
WHERE LooksPromising(C.ClosingPrices) 


where LooksPromising is a method of an investor- 
supplied Java class StockAnalysis. Such a class can 
be written by the investor, generated by a tool, or 
purchased from a software development house. A 
simple implementation is shown below: 


public class StockAnalysis { 
private static final int DAYS = 
private static final int VAR = 


40; 
0.023 


public static boolean 
LooksPromising (double [] 
int size = ts.length; 
if (size < DAYS) return false; 
if (ts[size DAYS] >= ts[size 
return false; 
for (int i = 1; i < DAYS; i++) { 
double price = ts[size a SE SE ie 
double prevPrice = ts[size - il]; 
double v = 
(price - prevPrice) /prevPrice; 
if (Math.abs(v) >= VAR) 
return false; 


ts) + 


1} ) 


} 


return true; 


} 


This kind of database extensibility has many benefits. 
Many complex filters can be coded much easier and 
more efficiently when using a programming language 
instead of SQL. UDFs can be used to integrate user- 
specific algorithms and external data sources. By 
controlling the use of the network and the file system. 
and by using protection mechanisms of Java, the server 
can ensure that its data is not corrupted or 
compromised. Cryptography-based protocols _ like 
Secure Socket Layer [SSL97] can be used to guarantee 
secure uploading of UDFs to the server. This means 
that if investors trust the server they can be assured that 
nobody else will see the code of their UDFs. which can 
be a concern when substantial effort was expended 
towards creating them. 


However, at the current state of the art of extensible 
database technologies [GMS+98] several important 
issues are still not addressed. These problems are 
discussed in the subsections below. They include 
dealing with denial-of-service attacks. accounting for 
resources consumed by a user's particular UDFs. and 
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supporting system scalability. For extensible databases 
where the UDFs are executed in the controlled 
environment of a safe language, these problems, to a 
large extent. boil down to the ability to monitor 
computational resources such as main memory, CPU 
usage. and network resources. 


2.1. Denial-of-Service Attacks 


The code of LooksPromising is not necessarily well 
behaved. People make mistakes - for instance, a 
programmer could forget to increment i in the for 
loop which can lead to a non-terminating execution of 
LooksPromising for some inputs. In addition to 
making mistakes. some code is developed with 
malicious purposes in mind. One could omit 
incrementing the loop counter on purpose, or, for 
instance, insert into LooksPromising code to allocate 
an infinite list so that all available main memory is 
monopolized by a single instance of the UDF. 
Regardless of whether such programs are created on 
purpose or unintentionally, they are equally dangerous 
in that they can monopolize vital resources. Except for 
a few trivial cases, it is virtually impossible to decide 
by means of static code analysis if a Java UDF will use 
more resources than a_ particular limit. Dynamic 
mechanisms that constrain resource usage are needed to 
prevent denial-of-service attacks. Traditional operating 
systems use hardware protection and coarse-grained 
process structure to enforce resource limits. Extensible 
object-relational database environments. in many ways 
subsuming the role of an operating system, need to 
provide the same functionality. 


2.2. Accounting for Consumed Resources 


Some database servers use accounting mechanisms to 
charge customers for service. The same will likely 
happen to extensible database servers based on Java. 
An immediate problem is that no mechanisms exist to 
enable accounting for resources consumed by Java 
UDFs. For instance, CPU time and heap memory used 
by an invocation of LooksPromising are unknown, 
since Java provides no support for gauging their usage. 


Ideally, one should be able to run a UDF and obtain a 
list of all the resources consumed by it. For instance, in 
the case of LooksPromising. the CPU time and 
maximum amount of memory used during the 
invocation should be available. This information can be 
used for profiling the code and for charging investors 
for resources consumed during the execution of their 
queries. Obtaining resource consumption traces from a 
running UDF is valuable for query optimizers. 


2.3. Scheduling and Scalability 


“ 


Another problem with deploying extensible database 
servers based on safe languages such as Java is the 
difficulty of managing large numbers of extensions. 
Since virtually no information about resource 
consumption can be obtained. the system does not 
know what UDFs are particularly resource-hungry and 
which resources will be stressed when a large number 
of copies of a particular UDFs are executing 
simultaneously. This potentially leads to unbalanced 
resource consumption patterns. For instance, let us 
imagine several thousand CPU-intensive UDFs copies 
running (or attempting to run) at the same time. If the 
UDFs do not adapt their behavior, they face the 
prospect of slow execution, of deadlock, of being 
stopped temporarily, or even of being killed by the 
system, depending on the local policy. This is likely to 
result in wasted resources since queries and/or UDFs 
will be aborted halfway through. Providing dynamic 
information about resources available to UDFs allows 
database systems to implement admission control 
policies that minimize the number of aborted UDFs. 
The UDFs themselves may be coded in a smart way to 
adapt to changing resource demand and _ supply. 
However, in order to be able to perform such coding. an 
infrastructure and an interface that allows the UDFs to 
learn about the loads during their execution must be 
provided. 


2.4. An Approach to Manage Resources in 
Extensible Database Servers 


The objective of this work is to provide mechanisms for 
selected components of resource management in an 
extensible database where UDFs are executed in a 
single running copy of the Java Virtual Machine. This 
includes (i) accounting for resource (CPU time, heap 
memory, network) usage on a per-UDF basis. (11) 
setting limits on resources available to particular UDFs. 
and (iii) providing the ability to define a specific action 
to be taken when a resource limit is exceeded. To this 
end we have extended Java and consequently the JVM 
serving as an extensibility mechanism with a resource 
accounting interface. called JRes. The extension does 
not require any changes to the underlying JVM and 
relies on dynamic bytecode rewriting and a small native 
component, coded in the C language. As will be 
demonstrated later in the paper, most of the problems 
discussed in this section are addressed in our prototype. 


3. Selected Details on Jaguar and JRes 
Environments 


This section contains a brief description of features of 
Jaguar and JRes relevant for the work presented in this 
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paper. Both systems have been described in detail 
elsewhere [GMS+98, CvE98]. 


3.1. Jaguar 


The Jaguar project extends the Cornell Predator object- 
relational database system [Sesh98a] with portable 
query execution. The goals of the project are two-fold: 
(a) to migrate client-side query processing into the 
database server for reasons of efficiency, (b) to migrate 
server-side query processing to any component of the 
channel between the server and the ultimate end-user. 
In short, the project aims to eliminate the artificial 
server-client boundaries with respect to query 
execution. The motivation of the project is the next- 
generation of database applications that will be 
deployed over the Web. In such applications, a large 
number of physically distributed end-users working on 
diverse and mutually independent applications interact 
with the database server. In this context, portable query 
execution can translate into greater options for efficient 
evaluation and consequently reduced user response 
times. 


The Predator database server is written in C++, and 
permits new extensions (new data types and UDFs, also 
written in C++). To explore goal (a) of the Jaguar 
project, the database server has been enhanced with the 
ability to define UDFs with Java. This provides clients 
with a portable mechanism with which to specify client- 
side operations and migrate them to the server. Java 
seems to be a good choice as a portable language for 
UDFs, because Java bytecode can be run with security 
restrictions within the Java Virtual Machine. 


In the current implementation, Java functions are 
invoked from within the server using either Sun’s Java 
Native Interface or Microsoft’s Raw Native Interface. 
The first step is to initialize the Java Virtual Machine 
(JVM) as a C++ object. Any classes that need to be 
used are loaded into the JVM using a custom interface. 
When methods of the classes need to be executed, they 
are invoked through JNI or RNI, depending which 
vendor’s JVM is currently used. Parameters that need to 
be passed to Java UDFs must be first mapped to Java 
objects. 


The creation of the JVM is a heavyweight operation. 
Consequently, a single JVM is created when the 
database server starts up and is used until server 
shutdown. Each Java UDF is packaged as a method 
within its own class. If a query involves a Java UDF, 
the corresponding class is loaded once for the whole 
query execution. The translation of data (arguments and 
results) requires the use of further interfaces of the 
JVM. Callbacks from the Java UDF to the server occur 
through the “native method” feature of Java. There are 
a number of details associated with the implementation 
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of support for Java UDFs. Importantly. security 
mechanisms can prevent UDFs from performing 
unauthorized functions. 


3.2. JRes 


Through JRes, the trusted core of Java-based extensible 
databases can (i) be informed of all new thread 
creations, (ii) state an upper limit on memory used by 
all live objects allocated by a particular thread or thread 
group, (iii) limit how many bytes of data a thread can 
send and receive, (iv) limit how much CPU time a 
thread can consume, and (v) register overuse callbacks, 
that is, actions to be executed whenever any of the 
limits is exceeded. Both trusted core and untrusted 
extensions can learn about resource limits and resource 
usage. 


JRes consists of two Java interfaces, one exception. and 
the class ResourceManager.The class defines 
constants identifying resources and exports several 
methods. The methods can be divided up into two 
categories: privileged and general access. The 
privileged, authenticated methods can be used only by 
the execution environment (server, browser). Setting 
and clearing resource limits, setting and invoking 
overuse callbacks all fall into this category. In the 
context of this work, this ensures that only the database 
server itself has privileged access to the resource 
management subsystem. UDFs are prevented from 
interfering with the resource management policies of a 
given system. 


The general access JRes methods. available to all 
entities in the system, allow for querying the resource 
management subsystem about resource usage of a 
particular thread or thread group, about resource limits 
imposed on a thread or a thread group, and about 
system-wide resource availability. 


Overuse callbacks can be coded as arbitrary Java code. 
Consequently, what they can do is limited by the 
control mechanisms that are part of the JVM. For 
instance, it is possible to lower a thread’s priority but it 
is impossible to change the  thread-scheduling 
algorithm. Another limitation is the inability to track 
memory allocated in the native code. This is due to the 
fact that most of JRes is implemented through bytecode 
rewriting. 


The design and operation of our current prototype. 
which combines Jaguar and JRes, is shown in Figure 1. 
In this example setup, two remote clients submit their 
queries through a Web interface. The UDF code (i.c. 
Java classes) is loaded by the Jaguar class loader. The 
subsequent execution is controlled by what the standard 
Java Security Manager and the JRes Resource Manager 
allow. 
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Figure 1. The design and operation of Jaguar extended with Resource Manager. 


4. Design Space 


Let us take a look at possible dimensions along which a 
resource monitoring facility can be taken advantage of 
in an extensible object-relational database system. The 
first dimension roughly quantifies the UDF 
programmer’s involvement in monitoring the resources. 
One end of the spectrum is populated by UDFs that 
monitor their own resource consumption and_ the 
resource limits to adjust their execution patterns with 
respect to changing resource availability. A UDF that 
dynamically adapts the accuracy of the produced results 


=, Optimization 


Functionality 


Security 
more resources than allowed. 


UDFs’ involvement 


UDFs that query the database server 
environment in order to adjust their 
execution and improve performance. 


UDFs that query the database server 
environment in order to avoid using 


to the availability of resources forms an example. The 
other end of the spectrum consists of systems. that 
monitor the resources available to extensions and apply 
this information to change execution of queries 
containing UDFs. A database server dynamically 
reordering conjunctive predicates depending on their 
resource usage would be placed here. 


The other, orthogonal dimension is the domain of 


application of knowledge about both system-wide and 
per-UDF resource consumption. One such domain is 
security - detection of malicious UDFs and preventing 
denial of service attacks. Another domain is 


| 
Database servers that optimize query 
execution through utilizing resource 
consumption information. 


Database servers that monitor 
resource consumption of UDFs to 
detect malicious behavior. 


Server’s involvement 


Responsibility for using resource information. 





Figure 2. Dimensions of applicability of resource monitoring. 
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optimization, where combining knowledge of resource 
demands and their availability may lead to improved 
execution times of UDFs. 


A system may occupy more than one quadrant in the 
outlined space. Using information concerning resource 
utilization and availability for optimization does not 
preclude its usage for enhancing system security. 
Similarly, both the UDFs and an_ object-relational 
database itself can independently take advantage of 
JRes feedback at the same time. Figure 2 summarizes 
the classification introduced above and gives examples 
belonging to each of the groups. 


5. Enhanced Database Security using JRes 


As stated earlier, protection provided by a_ safe 
language is only one component of the necessary 
security infrastructure provided by _ extensible 
environments. Another vital part, neglected so far in 
available designs, is the ability to control resources 
available to extensions and the subsequent ability to 
detect and neutralize malicious or otherwise resource- 
unstable UDFs. Since the class of database servers 


class ExtensibleDBServerRMP 
implements ThreadRegistrationCallback, OveruseCallback { 


private Object cookie; 


private ExtensibleDBServerRMP (Object cookie) 


discussed in this paper falls into the extensible 
environments category. it is crucial for an unimpeded 
development and deployment of this data access 
technology to pay attention to resource monitoring 
issues, 


Figure 3 shows one possible policy that limits cach 
UDF to one thread only. Moreover. such a thread is 
limited to no more than SOkB of memory and less than 
10 milliseconds of CPU time out of every 100 
milliseconds. The limits are set whenever a_ thread 
creation is detected by JRes. Whenever the memory 
limit is exceeded, an appropriate exception is thrown. In 
addition to signalling a problem, this effectively 
prevents the operation of object creation from 
completion. Exceeding the time limit results in 
lowering the offending thread’s priority: if the priority 
cannot be lowered any more. the thread is stopped. It 
must be pointed out that stopping threads should be 
dealt with carefully, since threads may own state or 
other resources, like open files. which may need to be 
saved or cleaned up appropriately before killing the 
thread. The underlined code in Figure 3 is a part of JRes 
(either as defined methods or interface methods to be 


{ this.cookie = cookie; } 


public static synchronized void initialize() { 


Object cookie = new Object(); 


ResourceManager.initialize(cookie) ; 


ExtensibleDBServerRMP rmp = 


} 


new ExtensibleDBServerRMP (cookie); 
ResourceManager.setThreadRegistrationCallback (cookie, 


rmp) ; 


public void threadRegistrationNotification(Thread t) { 





if 
if (udfHasThreadsAlready (t) ) 
ResourceManager.setLimits (cookie, 
ResourceManager.setLimits (cookie, 


} 


public void resourceUseExceeded (int 
if (resType == RESOURCE _CPU) { 

int priority = t.getPriority(); 

if (priority 

else { t.setPriority(priority - 

} else if 


(t.getThreadGroup() .getName() .equals ("system") ) 
{ stopThread(t); } 


Thread.MIN_PRIORITY) 


{ return; } 


RESOURCE_CPU, 
RESOURCE_MEM, 


t, 
t, 


10, 
50, 


100, 
0, 


this); 
this); 


resType, Thread t, 


long value) { 


{ stopThread(t); } 
Lig 3 


(resType == RESOURCE_MEM) { 


throw new JResResourceExceededException (“memory”) ; 





Figure 3. An example resource controlling policy for user defined functions. 
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defined). The code of several methods is not shown. 
The details of JRes and its interface have been 
presented in [CvE98]; the goal of Figure 3 is to 
demonstrate how resource controlling policies for Java 
UDFs can be defined. 


6. Design of Resource Control Feedback 
for Java UDFs 


The JRes interface allows for retrieving information 
about current system-wide resource availability and 
about per-UDF consumption. This information can be 
used in several ways to improve either overall system 
performance or the performance of “smart” UDFs. In 
this section. we describe several scenarios that show 
usage and applicability of the Jaguar/JRes resource 
monitoring. In the next section, we demonstrate the 
performance impact when JRes is used in this fashion. 


6.1. Obtaining UDF Costs as a Function of 
Input Arguments 


[Sesh98b] explores optimizations on the boundary 
between relational query execution and the execution of 
UDFs and method extensions. The paper identifies four 
categories of optimization opportunities and_ studies 
techniques applicable to each of the categories. An 
important category requires knowledge of the resource 
consumption of the UDFs. Our work provides a 
practical framework in which resource utilization 
information can be obtained and used for improving 
query plans. For instance, let us consider the following 
query 
SELECT C.Name 
FROM Companies C 
WHERE 
LooksPromising(C.ClosingPrices) = 


AND ExternalRating(C.Name) > 0.9 
AND Profitability(C.Name) = "Top” 


true 


The three UDF predicates are “black boxes” from the 
viewpoint of both the underlying database and the 
module managing the extensibility. In order to generate 
the optimal plan, the query optimizer must know the 
selectivity and cost of each predicate involved. Thus, an 
off-line or on-line gathering of performance and 
selectivity data is necessary in order to provide the 
query optimizer with the required information. In the 
example above, some predicates may access the 
network (for instance, ExternalRating may have to 
communicate with other databases), some may be very 
CPU-intensive, and others may use large quantities of 
memory. 


Applying JRes off-line to generate a table associating 
input sizes with execution time, bytes sent and received, 
and the maximum amount of memory used is simple. 


However, such a table makes sense only if the input 
size determines the resource consumption. The process 
of generating such tables may sometimes uncover that 
there is simply no correlation between the argument 
size and the resources consumed by the UDF. 


6.2. Dynamic Predicate Reordering Based 
on Resource Consumption 


It is often not possible to execute a query off-line - for 
instance when it has been submitted by a user during an 
interactive session with a database server. In such 
settings, Jaguar augmented with JRes is used to gather 
dynamic resource profiles. The information can then be 
fed dynamically to the execution engine, which may 
change the order of predicate execution based on 
similar criteria as in the static case. 


Dynamic resource monitoring has one advantage over 
static monitoring - relative values of resources are 
known, so localized adjustments can be performed 
better. Let us assume that in the example query from 
the previous subsection Profitability (very CPU- 
intensive) is applied after the equally selective 
ExternalRating (which consumes large quantities of 
network bandwidth). The order of predicates will 
change during the same query execution whenever the 
system detects that due to the presence of other queries 
and UDFs in the system there is currently contention 
for the network while a relatively large amount of CPU 
time is available. The predicates with high costs, in 
terms of currently scarce resources, are executed later, 
benefiting from the selectivity of earlier predicates. 


6.3. Dealing with Resource Shortages 
without Reduced Quality 


As described in detail in [Pang94], queries executing in 
a priority scheduling environment face the prospect of 
continually having resources taken away and then given 
back during their lifetime. The same statement is true 
for UDFs as well, especially for those invoked in 
queries with long lifetimes; typically, this category 
would include UDFs operating on large data inputs. Let 
us take a look at UDFs for which the quality of a result 
may not suffer but the completion time may worsen. 
For instance, let us consider a query that invokes a UDF 
in order to determine whether one image contains 
another: 


SELECT P.Name 
FROM Paintings P, Cats C 


WHERE Contains(P.Image, C.Image) = true 


The images are stored in a compressed format and 
Contains has to decompress them in order to run a 
pattern-matching algorithm. If memory is scarce, only 
parts of images may be decompressed. This will make 
the pattern-matching operation more time intensive 
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while the results will be the same, and, more 
importantly, invocations of Contains will not be 
prematurely aborted because of lack of memory. 


6.4. Adjusting Quality of UDF Results 
when Necessary Resources are Scarce 


In some scenarios, adapting to resource scarcity may be 
accomplished by degrading the quality of output. 
Examples include faster image operations resulting in 
worse quality of results that are nevertheless useful for 
the end user. Another such example can be seen 
through the eyes of a user of a financial database. Her 
UDFs return approximations of the standard deviation 
of an input time series. The CPU time available to any 
UDF invocation can be limited system-wide in order to 
make quick response times more likely for a large 
population of users. In this setting, the UDF must 
complete without using more resources as given - 
otherwise, it will be terminated and no result will be 
produced. Thus. while there is no bound on the length 
of the time series. the time available to the UDF is 
bounded. The UDF can query JRes for the CPU time 
available to itself. This, in turn, can be used to compute 
the number of entries of the input series that can be 
processed before using up the quota. If less then the 
whole series can be processed, it is up to the UDF to 
decide which ones: the most plausible choices include 
sampling with a fixed step size or using the most recent 
section of the time series. The return value may be less 
precise than whatever could be computed with 
unlimited resources, but is still a much better alternative 
than getting nothing back because the UDF’s execution 
has been aborted. 


6.5. Exploiting Resource Tradeoffs 


In some scenarios, one resource can be traded off for 
another in order to mask temporary or recurring 
fluctuation in resource availability. One example has 
been presented in Section 6.3. Another one is, for 
instance, a UDF that sends data back directly to the 
client via a network connection may choose to send 
compressed results or to send the data “as is”. In the 
first case, more CPU time but less bandwidth is needed; 
the reverse holds in the second scenario. The most 
common form of trading resources off for one another 
is caching, where memory (main or disk) is traded off 
for whatever resources were consumed to generate 
cached data. Let us take a look at the following join, 
where the UDF Similar detects a similarity between 
two time series, retrieved from some other table or from 
a file system: 


SELECT Dl.name, D2.name 
FROM Data Dl, Data D2 
WHERE Similar(D1l.name, 


D2.name) > 0.7 
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A naive way of coding Similar is to retrieve time 
series based on the names of arguments, compare 
inputs, and return the value describing the similarity. 
However, since the UDF is invoked repeatedly in this 
query, simple optimizations are possible. If the query is 
executed using a nested-loop join algorithm (scanning 
D1, and for each tuple, finding a “matching” tuple in 
D2), the UDF will be invoked several times with the 
same first argument. The UDF code could choose to 
cache the first argument, thereby using memory to 
reduce CPU and I/O time. 


7. Performance Study of Run-Time 
Adaptation 


While previous sections discussed possible uses of 
resource monitoring mechanisms, this section focuses 
on quantifying the impact of using JRes in Jaguar. The 
experimental results presented below were obtained on 
a Pentium II 300 MHz computer with 128 MB of RAM. 
running Windows NT Workstation 4.0. The Java 
Virtual Machine used by Jaguar was Microsoft Visual 
J++, v. 1.1. Each of the experiments uses a table T with 
1000 distinct tuples, each holding two integers. one of 
which is an identifier for a separately stored time scries. 
The experiments are simple and it can be argued that 
not very realistic, but they indicate potential 
performance gains resulting from dynamic monitoring 
of resource availability and appropriate adaptive 
behavior 


To set the stage for our experiments. Iet us consider 
three UDFs: UDF-1, UDF-2 and UDF-3. Each of them 
takes as an argument an integer identifying a certain 
time series and returns a boolean value. The costs of 
these UDFs are considerably larger than the costs of 
simple predicates (e.g. integer comparisons). The first 
two UDFs use caching to internally store results of their 
computations - sorting and computing various statistical 
moments of time series. UDF-3 does not cache its 
results and thus its execution time does not depend on 
the amount of memory available to it. Figure 4 shows 
the average execution time of each of the UDFs as a 
function of the amount of memory available per UDF. 
There were a hundred distinct time series involved; all 
of them fit into an 800kB cache. Two points are worth 
stressing here. First, the results of Figure 4 can be easily 
obtained and can then be used by a static query 
optimizer. Second, UDF-1 and UDF-2 are examples of 
user defined functions that utilize a possible resource 
tradeoff. In this particular case, the tradeoff was 
increased consumption of main memory (caching) 
versus reduced need for CPU time. 
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Figure 4. Execution time of three UDFs as a function of available memory. 


7.1. Dynamic Predicate Reordering 


The three UDFs were coded so that they have the same 
selectivity (on the average. each of them returns true 
for 30% of its inputs) and in fact always return the same 
answer if given the same input argument (i.e. whenever 
UDF1 is true. so are UDF2 and UDF3, and vice versa). 
Consider the following query: 


SELECT T.Timeseries 

FROM T 

WHERE UDF1(T.Timeseries) 
AND UDF2(T.Timeseries) 
AND UDF3(T.Timeseries) 


The execution time depends on the order in which the 
predicates are applied and on the amount of memory 
available to the UDFs . Every nontrivial predicate is 
associated with a certain cost and a certain selectivity. 
The latter determines the average ratio of tuples on 
which the predicate results in true. Selective and cheap 
predicates should be applied before less selective and 
more expensive predicates to reduce the overall 
execution cost. We picked three different evaluation 
orders: 1-2-3. 2-1-3. and 3-1-2', and compared their 
costs with the cost of a dynamically adapted order. We 
varied the available cache size, changing the relative 


' Because all three UDFs return the same values on identical 
arguments, it only matters which predicate is evaluated first: 
if it returns false. the later two are not evaluated; if it 
returns true, all other predicates are evaluated as well. The 
three picked permutations are equivalent in their complexity 
to 1-3-2, 2-3-1, and 3-2-1, respectively. 


costs of the predicates and thus their optimal order. 
Figure 5 shows the average per-tuple processing time 
for each of the three given evaluation orders and for an 
adaptive strategy. The latter monitored available 
memory and applied this information to dynamically 
optimize the evaluation order. Incurring a small 
overhead for the dynamic plan modification, the 
adaptive strategy always chooses the best order for the 
predicates. 


If the three UDFs were coded as one large UDF 
invoking the three tests by itself. the reordering could 
be done inside the UDF. In the case of predicates 
applied to the same input, it is possible (with a bit of 
additional work) to re-code them as a single predicate. 


7.2. Reordering Join and Selection 
Operations 


Let us now consider the following query, operating on a 
table T (with 1000 tuples, each of them consisting of 
two integers; the first one serves as a reference to a 
stored time series) and a table S (containing 10000 
tuples, each of them also consisting of two integers): 


SELECT * 
FROM T, S 


WHERE T.a = S.a and UDF1(T.a) 


Due to the equality predicate used in the join between T 
and S, the join has a certain selectivity with respect to 
the table T. The application of UDF-1 can take place 
either before or after the join, changing the cost of the 
overall query execution. Applying UDF-1 before the 
join results in an invocation of UDF-I on cach tuple of 
T. but reduces the number of tuples of T that have to be 
joined. On the other hand, applying the join first 
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Figure 5. Execution time of different predicate ordering strategies. 


requires less invocations of UDF-1, but more tuples are 
joined. The total cost of the query is different in both 
cases. Our prototype can change the plan dynamically, 
during query execution. Figure 6 shows how the two 
static strategies perform under changing memory 
availability and contrasts it with the performance of the 
dynamically adapted plan. The adaptation - applying 
selection before or after the join - is done similarly to 
the previous experiment: the resource monitoring 
information is used by Jaguar to change the plan while 
it is executing. As Figure 6 demonstrates, the 
performance gains can be quite substantial when 
memory availability changes frequently. As in the 


Query execution time (ms). 


0 80 


previous experiment, with a small overhead the 
adaptive strategy follows the best. hybrid plan. Let us 
note that in this particular experiment. unlike in the 
previous one, the query plan reordering can only be the 
responsibility of the query execution module -- it 
cannot be taken over by an adaptive UDF. 


7.3. Overheads Introduced by JRes 


The benefits of on-line resource monitoring come at a 
price of runtime overheads. For the UDFs used in our 
experiments, the added execution time overheads are 
within 3-6%. The overheads are directly proportional to 
the number of objects allocated by UDFs and in some 


Select before join. 
—a— Join before select. 


—a— Adaptive plan. 
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Figure 6. Execution time of different plans. 
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cases can be substantial [CvE98]. The overheads may 
be reduced if JRes is integrated into the JVM. Still, 
increased system security and the ability to adapt both 
execution plans and UDF execution have to be 
weighed against the increased execution time, 


8. Related Work 


Past work related to our research falls into two broad 
categories: (i) predicting and controlling resource 
consumption in existing database systems, and (ii) 
resource accounting and enforcing resource limits in 


traditional and extensible operating systems and 
programming languages (a much more detailed 


discussion of this area can be found in [CVE98}). In this 
section, we summarize the most important work from 
these areas influencing our research. 


8.1. Database Systems 


Several database systems and standards allow the 
implementation of functions in C, C++ or Java. either 
as predicates or as general functions. The examples 
include POSTGRES [SR86]. Starburst [HCL+90], Iris 
[WLH90]. and several commercially available systems 
- for instance Informix. DB2. Oracle 8. The issue of 
expensive predicate optimization was first raised in the 
context of POSTGRES [Sto91] and a_ practically 
applicable theory addressing the issue was developed in 
[HS93]. The goal of a recent work of Hellerstein and 
Naughton [HN97] is to optimize the execution of 
qucrics with expensive predicates by caching their 
arguments and results. The resulting technique, Hybrid 
Caching, is promising in the presence of repeated 
invocations of a predicate on the same arguments. 


Obtaining realistic estimates of the costs of user defined 
methods is difficult and quite often imprecise [Hel95]. 
Typically, it is assumed that. along with estimating 
selectivity. the creator or user of a UDF will provide a 
cost estimate as well. Assuming that cost estimates are 
correct and remain constant throughout the entire 
execution of the query, it is possible to efficiently 
generate an optimal plan over the desired execution 
space [CS96]. 


Another line of research refines query optimization by 
focusing on join reordering where an important 
working assumption is that predicates are zero-cost 
[IK84. KBZ86, S192]. A general formulation of query 
optimization for various buffer sizes can be found in 
[INS+92]. This runtime parameter is typically unknown 
before the actual query execution. By constructing 
various plans in advance, the most appropriate one can 
be chosen at run-time just before the query is executed, 
when the available buffer size is known. Another 
technique helping with estimation of the query size is 
adaptive sampling [LNS90]. where statistical methods 


are used to predict the result size based on selective 
runs of the estimated query. Complcting joins and sorts 
under fluctuating availability of main memory has been 
the subject of recent research by [Pang94]. 


Dynamic query optimization was incorporated into a 
commercially available Rdb/VMS system [Ant93]. The 
research suggests that it is cost-effective to.run several 
local plans simultaneously with proportional speed for a 
short time. and then select the “best” plan to be run for 
a long time. An optimization model that assigns the 
bulk of the optimization effort to compile-time and 
delays carefully selected optimization decisions until 
runtime is described in [CG94]. Dynamic plans are 
constructed at compile-time and the best one is selected 
at runtime, when cost calculations and comparisons can 
be performed. The approach guarantees plan optimality. 
However, none of these approaches deals with 
unknown and changing costs of user defined functions. 


Our work differs from the rescarch mentioned above in 
our focus on UDFs and on monitoring the environment 
in which UDFs execute. In addition to providing the 
ability to run queries off-line to get estimates of their 
cost, our system constantly monitors resource 
utilization. This information is available directly both to 
the UDFs themselves and the query execution module. 
Both the database system and UDFs can utilize this 
knowledge directly and dynamically. 


8.2. Operating Systems and Programming 
Languages 


Enforcing resource limits has long been a responsibility 
of operating systems. For instance, many UNIX shells 
export the limit command, which sets resource 
limitations for the current shell and its child processes. 
Among others, available CPU time and maximum sizes 
of data segment, stack segment, and virtual memory can 
be set. Enforcing resource limits in traditional operating 
systems is coarse-grained in that the unit of control is 
an entire process. The enforcement relics on kerncl- 
controlled process scheduling and hardware support for 
detecting memory overuse. 


The architecture of the SPIN extensible operating 
system allows applications to safely change the 
operating system’s interface and implementation 
[BSP+95]. SPIN and its extensions are written in 
Modula-3 and rely on a certifying compiler to guarantee 


the safety of extensions. The CPU consumption of 


untrusted extensions can be limited by introducing a 
time-out. Another example of an extensible operating 
system concerned with constraining resources 
consumed by extensions is the VINO kernel [SES+96]. 
VINO uses software fault isolation as its safety 
mechanism and a lightweight transaction system to 
cope with resource hoarding. Timeouts are associated 
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with time-constrained resources. If an extension holds 
such a resource for too long, it is terminated. The 
transactional support is used to restore the system to a 
consistent state after aborting an extension. 


Except for the ability to manipulate thread priorities and 
invoke garbage collection, Java programmers are not 
given any interface to control resource usage of 
programs. Several extensions to Java attempt to 
alleviate this problem, but none of them share the goals 
of JRes. For instance, the Java Web Server [JWS97] 
provides an administrator interface that displays 
resource usage in a coarse-grained manner, e.g. the 
number of running threads; however, the information 
about memory or CPU used by each individual thread is 
not accessible. PERC (a real-time implementation of 
Java) [Nils96] provides an API for obtaining 
guaranteed execution time and assuring resource 
availability. While the goal of real-time systems is to 
ensure that applications obtain at /east as many 
resources as necessary, the goal of JRes is to ensure that 
programs do not exceed their resource limits. 


9. Conclusions 


The security and functionality of an extensible database 
server can be enhanced by providing resource- 
controlling mechanisms in the language used for 
creating user-defined functions. Because of the 
combination of portability, security, and  object- 
orientation, Java emerges as a premier language for 
creating extensible environments. Our work evaluates, 
in the context of extensible database servers. a resource 
controlling interface we have developed for general 
purpose Java programs. To the best of our knowledge, 
no database system supporting UDFs (or, for that 
matter. no other extensible server system that does not 
rely on hardware protection) currently provides the 
functionality we have added to Jaguar. The presented 
description of the system design, the evaluation of 
resource monitoring. and the provision of mechanisms 
for adaptive behavior are important steps towards 
practical extensible servers. 


In particular, our work further limits the amount of trust 
that the database server must have with respect to the 
behavior of extensions. The standard JVM controls 
access of UDFs to security-sensitive resources such as 
files and network. This paper demonstrates that a class 
of UDFs that may execute in a database server without 
affecting the execution of the server or other extensions 
has been enlarged to contain UDFs with unknown and 
potentially malicious or unbalanced _ resource 
requirements. Moreover, the paper shows that the 
execution cost of a UDF may depend on the dynamic 
supply of computational resources. Thus, changing 
query plan dynamically, during the query execution, is 
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necessary to achieve optimal performance. Jaguar 
extended with JRes provides appropriate mechanisms 
for achieving this goal. 


Even though this work is carried out in the context of 
an extensible object-relational database and Java 
extensions, the conclusions generalize to any system 
where Java code dynamically extends an execution 
environment, like a Web browser or an extensible Web 
server. The security can be enhanced and performance 
concerns can be addressed in such environments in a 
similar way to our prototype implementation analyzed 
in the paper. 
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Abstract 


Texas is a highly portable, high-performance persistent 
object store that can be used with conventional compil- 
ers and operating systems, without the need for a prepro- 
cessor or special operating system privileges. Texas uses 
pointer swizzling at page fault time as its primary ad- 
dress translation mechanism, translating addresses from 
a persistent format into conventional virtual addresses 
for an entire page at a time as it is loaded into memory. 


Existing classifications of persistent systems typically 
focus only on address translation taxonomies based on 
semantics that we consider to be confusing and ambigu- 
ous. Instead, we contend that the granularity choices 
for design issues are much more important because they 
facilitate classification of different systems in an unam- 
biguous manner unlike the taxonomies based only on ad- 
dress translation. We have identified five primary design 
issues that we believe are relevant in this context. We 
describe these design issues in detail and present a new 
general classification for persistence based on the gran- 
ularity choices for these issues. 


Although the coarse granularity of pointer swizzling at 
page fault time is efficient in most case, it is sometimes 
desirable to use finer-grained techniques. We examine 
different issues related to fine-grained address transla- 
tion mechanisms, and discuss why these are not suitable 
as general-purpose address translation techniques. In- 
stead, we argue for a mixed-granularity approach where 
a coarse-grained mechanism is used as the primary ad- 
dress translation scheme, and a fine-grained approach is 
used for specialized data structures that are less suitable 
for the coarse-grained approach. 


We have incorporated fine-grained address translation in 
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Texas using the C++ smart pointer idiom, allowing pro- 
grammers to choose the kind of pointer used for any 
data member in a particular class definition. This ap- 
proach maintains the important features of the system: 
persistence that is orthogonal to type, high performance 
with standard compilers and operating systems, suitabil- 
ity for huge shared address spaces across heterogeneous 
platforms, and the ability to optimize away pointer swiz- 
zling costs when the persistent store is smaller than the 
hardware-supported virtual address size. 


1 Introduction 


The Texas Persistent Store provides portable, high- 
performance persistence for C++ [16, 8], using pointer 
swizzling at page fault time [23, 8] to translate addresses 
from persistent format into virtual memory addresses. 
Texas is designed to implement and promote orthogonal 
persistence [1, 2]. Orthogonal persistent systems require 
that any arbitrary object can be made persistent without 
regard to its type; that is, persistence is viewed as the 
storage class! of an object rather than as a property of its 
type. In other words, persistence is a property of individ- 
ual objects, not of their classes or types, and any object 
can be made persistent regardless of its type. In contrast, 
class-based persistent systems require that any type or 
class that may be instantiated to create persistent objects 
must inherit from a top-level abstract “persistence” class, 
which defines the interface for saving and restoring data 
from a persistent object store. 


Texas uses pointer swizzling at page fault time as the 
primary address translation technique. When a page is 
brought into memory, all pointers in the page are iden- 
tified and translated (or swizzled) into raw virtual ad- 


'A storage class describes how an object is stored. For example, 
the storage class of an automatic variable in C or C++ corresponds to 
the stack because the object is typically allocated on the data stack, 
and its lifetime is bounded by the scope in which it was allocated. 
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dresses. If the corresponding referents are not already 
in memory, virtual address space is reserved for them 
(using normal virtual memory protections), allowing for 
the address translation to be completed successfully. As 
the application dereferences pointers into non-resident 
pages, these are intercepted (using virtual memory ac- 
cess protection violations) and the data is loaded from 
the persistent store, causing further pointer swizzling 
and (potential) address space reservation for references 
to other non-resident data. Since running programs only 
see pointers in their normal hardware-supported format, 
conventionally-compiled code can execute at full speed 
without any special pointer format checks. 


This page-wise address translation scheme has several 
advantages. One is that it exploits spatial locality of ref- 
erence, allowing a single virtual memory protection vio- 
lation to trigger the translation of all persistent addresses 
in a page. Another is that off-the-shelf compilers can 
be used, exploiting virtual memory protections and trap 
handling features available to normal user processes un- 
der most modern operating systems. 


However, as with any other scheme that exploits locality 
of reference, it is possible for some programs to exhibit 
access patterns that are unfavorable to a coarse-grained 
scheme; for example, sparse access to large indexing 
structures may unnecessarily reserve address space with 
page-wise address translation than with more conven- 
tional pointer-at-a-time strategies. It is desirable to get 
the best of both worlds by combining coarse-grained and 
fine-grained address translation in a single system. 


In Texas, we currently support a fine-grained address 
translation strategy by using smart pointers [17, 7, 12] 
that can replace normal pointers where necessary. Such 
pointers are ignored by the usual swizzling mechanism 
when a page is loaded into memory; instead, each 
pointer is individually translated as it is dereferenced us- 
ing overloaded operator implementations. The mixed- 
granularity approach works well, as shown by experi- 
mental results gathered using the OO1 benchmark [4, 5]. 


The remainder of this paper is structured as follows. 
In Section 2, we describe existing well-known address 
translation taxonomies put forth by other researchers, 
and motivate the need for a general classification of per- 
sistence presented in Section 3. In Section 4, we discuss 
issues about fine-grained address translation techniques, 
and why we believe that a pure fine-grained approach is 
not suitable for general use. We describe the implemen- 
tation of mixed-granularity address translation in Texas 
in Section 5 and the corresponding performance results 
in Section 6, before wrapping up in Section 7. 
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2 Address Translation Taxonomies 


Persistence has been an active research area for over 
a decade and several taxonomies for pointer swizzling 
techniques have been proposed [13, 9, 11, 19]. In this 
section, we describe important details about each of 
these taxonomies and highlight various similarities and 
differences among them. We also use this as a basis to 
provide motivation for a general classification of persis- 
tent systems based on granularity issues, which we de- 
scribe in Section 3. 


2.1 Eager vs. Lazy Swizzling 


Moss [13] describes one of the first studies of different 
address translation approaches, and the associated termi- 
nology developed for classifying these techniques. The 
primary classification is in terms of “eager” and “lazy” 
swizzling based on when the address translation is per- 
formed. Typically, eager swizzling schemes swizzle an 
entire collection of objects together, where the size of 
the collection is somehow bounded. That is, the need 
to check pointer formats, and the associated overhead, is 
avoided by performing aggressive swizzling. In contrast, 
lazy swizzling schemes follow an incremental approach 
by using dynamic checks for unswizzled objects. There 
is no predetermined or bounded collection of objects that 
must be swizzled together. Instead, the execution dy- 
namically locates and swizzles new objects depending 
on the access patterns of applications. 


Other researchers [9, 11] have also used classifications 
along similar lines in their own studies. However, we 
consider this classification to be ambiguous and confus- 
ing for general use. It does not clearly identify the funda- 
mental issue—the granularity of address translation— 
that is important in this context. For example, consider 
pointer swizzling at page fault time using this classifi- 
cation. By definition, we swizzle all pointers in a vir- 
tual memory page as it is loaded into memory and an 
application is never allowed to “see” any untranslated 
pointers. There is no need to explicitly check the format 
of a pointer before using it, making pointer swizzling 
at page fault time an eager swizzling scheme. On the 
other hand, the basic approach is incremental in nature; 
swizzling is performed one page at a time and only on 
demand, making it a lazy swizzling scheme as per the 
original definition. 


In general, a scheme that is “lazy” at one granularity is 
likely to be “eager” at another granularity. For example, 
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a page-wise swizzling mechanism is lazy at the granular- 
ity of pages because it only swizzles one page at a time, 
but eager at the granularity of objects because it swizzles 
multiple objects—an entire page’s worth—at one time. 
As such, we contend that the granularity at which ad- 
dress translation is performed is the fundamental issue. 


2.2 Node-Marking vs. Edge-Marking Schemes 


Moss also describes another classification based on the 
strategy used for distinguishing between resident and 
non-resident data in the incremental approach. The per- 
sistent heap and various data structures are viewed as 
a directed graph, where data objects represent nodes 
and pointers between objects represent edges that con- 
nect the nodes. The address translation mechanisms are 
then classified as either node-marking or edge-marking 
schemes. 


Figure | shows the basic structure for node-marking and 
edge-marking schemes. As the name suggests, edge- 
marking schemes mark the graph edges—the pointers 
between objects—to indicate whether they have been 
translated into local format and reference resident ob- 
jects. In contrast, node-marking schemes guarantee that 
all references in resident objects are always translated, 
and the graph nodes themselves are marked to indicate 
whether they are non-resident. In other words, edges 
are guaranteed to be valid local references but the actual 
referents may be non-resident. Note that the marking ap- 
plies only to non-resident entities, that is, either to nodes 
that are non-resident or to (untranslated) edges that ref- 
erence non-resident nodes. 


Figure 2 shows a classic implementation of a node- 
marking scheme; non-resident nodes are “marked” as 
such by using proxy objects, that is, pseudo-objects that 
stand in for non-resident persistent objects and contain 
their corresponding persistent identifiers. When an ob- 
ject is loaded from the database, all references contained 
in that object must be swizzled as per the definition of 
node-marking—pointers to resident objects are swizzled 
normally while pointers to non-resident objects are swiz- 
zled into references to proxy objects. When the appli- 
cation follows a reference to a proxy object, the system 
loads the referent (F in the figure) from the database and 
updates the proxy object to reference the newly-resident 
object (Figure 2b). Alternatively, the proxy object may 
be bypassed by overwriting the (old) reference to it with 
a pointer to the newly-resident object; if there are no 
other references to it, the proxy object may (eventually) 
be reclaimed by the system. Note, however, that the 


compiled code must still check for the presence of proxy 
objects on every pointer dereference because of the pos- 
sibility that any pointer may reference a proxy object. 
This adds continual checking overhead, even when all 
pointers directly reference data objects without interven- 
ing proxy objects. 


Pointer swizzling at page fault time is essentially a node- 
marking scheme, because swizzled pointers always cor- 
respond to valid virtual memory addresses, while the 
referents are distinguished on the basis of residency. 
However, it differs in an important way from the nor- 
mal approach—unlike the classic implementation, there 
are no explicit proxy objects for non-resident in pointer 
swizzling at page fault time. Instead, access-protected 
virtual address space pages act as proxy objects.2 As 
the application progresses and more data is loaded into 
memory, the pages that were previously protected are 
now unprotected because they contain valid data. The 
major advantage of this approach is that there is no need 
to reclaim proxy objects (because none exist); conse- 
quently, there are no further indirections that must be 
dealt with by compiled code, avoiding continual format 
checks that would otherwise be necessary. 


2.3. General Classification for Persistence 


We have seen that existing classifications focus only on 
address translation techniques. While address transla- 
tion is an important issue, it constitutes only one of sev- 
eral design issues that must be considered when imple- 
menting persistence. We have identified a set of design 
issues that we believe are fundamental to efficient im- 
plementation of any persistence mechanism. We believe 
that a specific combination of these issues can be used 
to characterize any particular implementation. In effect, 
we are proposing a general classification scheme based 
on granularities of fundamental design aspects. 


A classification based on “eager” and “lazy” swizzling is 
ambiguous, because it does not attack the problem at the 
right level of abstraction. The real issue in the distinction 
between lazy and eager swizzling is the size of the unit 
of storage for which address translation is performed. 
This can range from as small as a single reference (as 
in Moss’s “pure lazy swizzling” approach) to a virtual 
memory page (as in pointer swizzling at page fault time), 
or even as large as an entire database (as in Moss’s “pure 
eager swizzling” approach). 


?In fact, unmapped virtual address space pages can also serve the 
same purpose. 
3While crude, this is actually not uncommon. Traditionally, Lisp 
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Figure 1: Node-marking and edge-marking schemes 
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Figure 2: Node-marking scheme using proxy objects 


We believe that it is preferable to consider address trans- 
lation (and other design issues) from the perspective of 
a granularity choice rather than an ad hoc classifica- 
tion based on confusing translation semantics. In fact, 
the ambiguity arises primarily because the classifica- 
tions either do not clearly identify the granularity, or, 
because they unnecessarily adhere to a single predeter- 
mined granularity. Discussing all design issues in terms 
of granularity choices provides a uniform framework for 
identifying the consequence of each design issue on the 
performance and flexibility of the resulting persistence 
mechanism. This is preferable to ambiguous classifi- 
cations such as eager and lazy swizzling because many 
schemes are both “eager” and “lazy” at different scales, 
along several dimensions. 


and Smalltalk systems have supported the saving and restoring of en- 
tire heap images in a “big inhale” relocation. 
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3 Granularity Choices for Persistence 


We have identified a set of five design issues (including 
address translation) that are relevant to the implementa- 
tion of a persistence mechanism. Each of these issues 
can be resolved by making a specific granularity choice 
that is independent of the choice for any other issue. The 
combination of granularity choices for the different is- 
sues can then be used to characterize persistent systems. 
The specific design issues that we describe in this sec- 
tion are the granularities of address translation, address 
mapping, data fetching, data caching and checkpoint- 
ing. In the remainder of this section, we define and dis- 
cuss each issue in detail* and also present the rationale 
behind the granularity choices for these issues in our im- 
plementation of orthogonal persistence in Texas. 


4Note that while we describe each issue individually, these granu- 
larity choices are strongly related. It is possible (and quite likely) that 
a system may make the same granularity choice on multiple issues for 
various reasons. 
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To a first approximation, the basic unit for all granular- 
ity choices in Texas is a virtual memory page, because 
pointer swizzling at page fault time relies heavily on vir- 
tual memory facilities, especially to trigger data transfer 
and address translation. The choice of a virtual mem- 
ory page as the basic granularity unit allows us to exploit 
conventional virtual memories, and avoid expensive run- 
time software checks in compiled code, taking advan- 
tage of user-level memory protection facilities of most 
modern operating systems. Sometimes, however, it is 
necessary to change the granularity choice for a partic- 
ular issue to accommodate the special needs of unusual 
situations. It is possible to address these issues at a dif- 
ferent granularity in a way that integrates gracefully into 
the general framework of Texas. 


3.1 Address Translation 


The granularity of address translation is the smallest 
unit of storage within which all pointers are translated 
from persistent (long) format to virtual memory (short) 
format. In general, the spectrum of possible values can 
range from a single pointer to an entire page or more. 


The granularity of address translation in Texas is typi- 
cally a virtual memory page, for coarse-grained trans- 
lation implemented via pointer swizzling at page fault 
time. The use of virtual memory pages has several ad- 
vantages in terms of overall efficiency because we use 
virtual memory hardware to check residency of the refer- 
ents. In addition, we also rely on the application’s spatial 
locality of reference to amortize the costs of protection 
faults and swizzling entire pages. 


As described in Section 5, it is possible to implement a 
fine-grained address translation mechanism for special 
situations where the coarse-grained approaches are un- 
suitable, because of poor locality of reference in the ap- 
plication. Since Texas allows fine-grained translation on 
individual pointers, the granularity of address translation 
in those cases would be a single pointer. 


3.2 Address Mapping 


A related choice is the granularity of address mapping, 
which is defined as the smallest unit of addressed data 
(from the persistent store) that can be mapped indepen- 
dently to an area of the virtual address space. 


To a first approximation, this is a virtual memory page in 
Texas because any page of persistent data can be mapped 


into any arbitrary page of the virtual address space of 
a process. A major benefit of page-wise mapping is 
the savings in table sizes; we only need to maintain ta- 
bles that contain mappings from persistent to virtual ad- 
dresses and vice versa on a page-wise basis, rather than 
(much larger) tables for recording the locations of in- 
dividual objects. This reduces both the space and time 
costs of maintaining the address translation information. 


However, the granularity of address mapping is bigger 
than a page in the case of large (multi-page) objects. 
When a pointer to (or into) a large object is swizzled, 
virtual address space must be reserved for all pages that 
the large object overlaps. This reservation of multiple 
pages is necessary to ensure that normal indexing and 
pointer arithmetic works as expected within objects that 
cross page boundaries. The granularity of address map- 
ping is then equivalent to the number of pages occupied 
by the large object. 


3.3. Data Fetching 


As the name suggests, the granularity of data fetching is 
the smallest unit of storage that is loaded from the per- 
sistent store into virtual memory. As with the two gran- 
ularities presented above, we use a virtual memory page 
for this purpose in the current implementation of Texas. 
The primary motivation for making this choice was sim- 
plicity and ease of implementation, and the fact that this 
correlated well with the default granularity choices for 
other design issues in our implementation. 


It is possible to change the granularity of fetching with- 
out affecting any other granularity choices. In essence, 
we can implement our own prefetching to preload data 
from the persistent store. This may actually be desir- 
able for some applications when using raw unbuffered 
I/O instead of normal file I/O [8]. Depending on the 
access characteristics of the application and the dataset 
size, the overall I/O costs can be reduced by prefetching 
several (consecutive) pages instead of a single faulted- 
on page. In general, the granularity of data fetching is 
intimately tied to the I/O strategy that is selected in the 
implementation. 


3.4 Data Caching 


The granularity of data caching is defined as the small- 
est unit of storage that is cached in virtual memory. For 
Texas, the granularity of caching is a single virtual mem- 


5th USENIX Conference on Object-Oriented Technologies and Systems (COOTS '99) 


103 


104 


ory page, because Texas relies exclusively on the virtual 
memory system for caching persistent data. 


A persistent page is usually cached in a virtual memory 
page as far as Texas is concerned. The virtual mem- 
ory system determines whether the page actually resides 
in RAM (i.e., physical memory) or on disk (i.e., swap 
space) without any intervention from Texas. This is 
quite different from some other persistent storage sys- 
tems which directly manage physical memory and con- 
trol the mapping of persistent data into main memory. 
In general, Texas moves data between a persistent store 
and the virtual memory without regard to the distinction 
between virtual pages in RAM and on disk; that is, vir- 
tual memory caching is left up to the underlying virtual 
memory system, which does its job in the normal way. 


It is, of course, possible to change this behavior such that 
Texas directly manages physical memory. However, we 
believe that this is unnecessary, and may even be unde- 
sirable, for most applications. The fact that Texas be- 
haves like any normal application with respect to virtual 
memory replacement may be advantageous for most pur- 
poses because it prevents any particular application from 
monopolizing system resources (RAM in this case). As 
such, applications using Texas are just normal programs, 
requiring no special privileges or resources; they “play 
well with others” rather than locking up large amounts 
of RAM as many database and persistent systems do. 


3.5 Checkpointing 


Finally, we consider the granularity of checkpointing, 
which is defined as the smallest unit of storage that is 
written to non-volatile media for the purpose of sav- 
ing recovery information to protect against failures and 
crashes. 


Texas uses virtual memory protections to detect pages 
that are modified by the application between check- 
points. Therefore, the default unit of checkpointing in 
the usual case is a virtual memory page. Texas em- 
ploys a simple write-ahead logging scheme to support 
checkpointing and recovery—at checkpoint time, mod- 
ified pages are written to a log on stable storage before 
the actual database is updated [16]. 


The granularity of checkpointing can be refined by the 
use of sub-page logging. The approach relies on a page 
“diffing” technique that we originally proposed in [16]. 
The basic idea is to save clean versions of pages before 
they are modified by the application; the original (clean) 
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and modified (dirty) versions of a page can then be com- 
pared to detect the exact sub-page areas that are actu- 
ally updated by the application and only those “diffs” 
are logged to stable storage. This technique can be used 
to reduce the amount of I/O at checkpoint time, subject 
to the application’s locality characteristics. The granu- 
larity of checkpointing in this case is equivalent to the 
size of the “diffs” which are saved to stable storage.* 


Another enhancement to the checkpointing mechanism 
is to maintain the log in a compressed format. As the 
checkpoint-related data is streamed to disk, we can inter- 
vene to perform some inline compression using special- 
ized algorithms tuned to heap data. Further research has 
been initiated in this area [24] and initial results indicate 
that the I/O cost can be reduced by about a factor of two, 
and that data can be compressed fast enough to double 
the effective disk bandwidth on current machines. As 
CPU speeds continue to increase fast than disk speeds, 
the cost of compression shrinks exponentially relative to 
cost of disk I/O. Further reduction in costs is also possi- 
ble with improved compression algorithms and adaptive 
techniques. 


4 Fine-grained Address Translation 


There are several factors that motivated us to develop a 
coarse-grained mechanism over a fine-grained approach 
when implementing pointer swizzling at page fault time 
in Texas. The primary motivation is the fact that we 
wanted to exploit existing hardware to avoid expensive 
residency checks in software. However, we believe that 
there are also other factors against using a fine-grained 
approach as the primary address translation mechanism. 
In this section, we discuss fine-grained address trans- 
lation techniques and why we believe that they are not 
practical for high-performance implementations in terms 
of efficiency and complexity. 


Overall, fine-grained address translation techniques are 
likely to incur various hidden costs that have not been 
measured and quantified in previous research. In gen- 
eral, we have found most current fine-grained schemes 
appear to be slower than pointer swizzling at page fault 
time in terms of the basic address translation perfor- 
mance. 


5The basic “diffing” technique has been implemented in the con- 
text of QuickStore [19]; preliminary results are encouraging, although 
more investigation is required. 
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4.1 Basic Costs 


Fine-grained address translation techniques usually in- 
cur some inherent costs due to their basic implemen- 
tation strategy. These costs can be divided into the 
usual time and space components, as well as less tan- 
gible components related to implementation complexity. 
We believe that these costs are likely to be on the or- 
der of tens of percent, even in well-engineered systems 
with custom compilers and fine-tuned run-time systems. 
Some of the typical costs incurred in a fine-grained ap- 
proach are as follows: 


e A major component of the total cost can be at- 
tributed to pointer validity checks. These checks 
can include both swizzling checks and residency 
checks. A swizzling check is used to verify whether 
a reference is translated into valid local format 
or not® while a residency check verifies whether 
the referent is resident and accessible. These two 
checks, while conceptually independent of each 
other, are typically combined in implementations 
of fine-grained schemes. 


e Another important component of the overall cost 
is related to the implementation of a custom ob- 
ject replacement policy, which is typically required 
because physical memory is directly managed by 
the persistence mechanism. This cost is usually di- 
rectly proportional to the rate of execution because 
it requires a read barrier.’ We discuss this further in 
the next subsection. 


e As resident objects are evicted from memory, a pro- 
portional cost is usually incurred in invalidating ref- 
erences to the evicted objects. This is necessary for 
maintaining referential integrity by avoiding “dan- 
gling pointers.” This cost is directly proportional 
to the rate of eviction and locality characteristics of 
the application. 


e By definition, fine-grained translation techniques 
permit references to be in different formats during 
application execution. This requires that pointers 
be checked to ensure that they are in the right for- 
mat before they can be used, even for simple equal- 
ity checks. It may also be necessary to check tran- 
sient pointers, depending on the underlying imple- 
mentation strategy. As such, there is a continual 


For example, all swizzled pointers in Texas must contain valid 
virtual memory address values. 

7The term read barrier, borrowed from garbage collection re- 
search [21], is used to denote a trigger that is activated on every read 
operation. A corresponding term, write barrier, is used to denote trig- 
gers that are activated for every write operation. 


pointer format checking cost that is also dependent 
on the rate of execution and pointer use. 


e Finally, it is possible to incur other costs that ex- 
ist mainly because of unusually constrained ob- 
ject and/or pointer representations used by the sys- 
tem. For example, accessing an object through 
an indirection via a proxy object is likely to re- 
quire additional instructions.’ Another example is 
the increased complexity required for handling lan- 
guages features such as interior pointers.? 


Note that all cost factors described above do not nec- 
essarily contribute to the overall performance penalty in 
every fine-grained address translation mechanism. How- 
ever, the basic costs are usually present in some form in 
most systems. 


4.2 Object Replacement 


Fine-grained address translation schemes typically re- 
quire that the persistence mechanism directly manage 
physical memory because persistent data are usually 
loaded into memory on a per-object basis.!° Therefore, 
it is usually necessary to implement a custom object re- 
placement policy as part of the persistence mechanism. 
This affects not only the overall cost but also the imple- 
mentation complexity. 


A read barrier is typically implemented for every object 
that resides in memory. The usual action for a read bar- 
rier is to set one bit per object for maintaining recency 
information about object references to aid the object re- 
placement policy. The read barrier may be implemented 
in software by preceding each object read with a call to 
the routine that sets the special bit for that object. Com- 
piled code then contains extra instructions—usually in- 
serted by the compiler—to implement the read barrier. 
The read barrier is typically expensive on stock hardware 
because, in the usual case, all read requests must be in- 
tercepted and recorded. It is known that one in about ten 
instructions is a pointer store (i.e., a write into a pointer) 
in Lisp systems that support compilation. Since read ac- 
tions are more common than write actions, we estimate 


8 Some systems use crude replacement and/or checkpointing poli- 
cies to simplify integration with persistence and garbage collection 
mechanisms. These may incur additional costs due to the choice of 
suboptimal policies. 

9 Interior pointers are those that point inside the bodies of objects 
rather than at their heads. 

!0The data are usually read from the persistent store into a buffer 
(granularity of data fetching) in terms of pages for minimizing I/O 
overhead. However, only the objects required are copied from the 
buffer into memory (granularity of data caching). 
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that between 5 and 20 percent of total instructions in an 
application usually correspond to a read from a pointer. 
The exact number obviously varies by application, and 
more importantly, by the source language; for example, 
it is likely to be higher in heap-oriented languages such 
as Java. It may be possible to use data flow analysis 
during compilation such that the read barrier can be op- 
timized away for some object references; such analysis 
is, however, hard to implement. 


The object replacement policy also interferes with gen- 
eral swizzling, especially if an edge-marking technique 
is being used. In such cases, the object cannot be evicted 
from memory without first invalidating all edges that ref- 
erence it. This obviously requires knowledge about ref- 
erences to the object being evicted. Kemper and Koss- 
man [9] solve this by using a per-object data structure 
known as a Reverse Reference List (RRL) to maintain a 
set of back-pointers to all objects that reference a given 
object. McAuliffe and Solomon [11] use a different data 
structure, called the swizzle table, a fixed-size hash table 
that maintains a list of all swizzled pointers in the sys- 
tem. Both these approaches are generally unfavorable 
because they increase the storage requirements (essen- 
tially doubling the number of pointers at the minimum) 
and the implementation complexity. 


4.3 Discussion 


One of the problems in evaluating different fine-grained 
translation mechanisms is the lack of good measure- 
ments of system costs and other related costs in these 
implementations. The few measurements that do ex- 
ist correspond to interpreted systems (except the E sys- 
tem [14, 15]) and usually underestimate the costs for a 
high-performance language implementation. For exam- 
ple, a 30% overhead in a slow (interpreted) implementa- 
tion may be acceptable for that system, but will certainly 
be unacceptable as a 300% overhead when the execution 
speed is improved up by a factor of ten using a state-of- 
the-art compiler. 


Another cost factor for fine-grained techniques that has 
generally been overlooked is the cost of maintaining 
mapping tables for translating between the persistent and 
transient pointer formats. Since fine-grained schemes 
typically translate one pointer at a time, the mapping ta- 
bles must contain one entry per pointer. This is likely 
to significantly increase the size of the mapping table, 
making it harder to manipulate efficiently. 


We believe that the E system [14, 15] is probably 
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the fastest fine-grained scheme that is comparable to a 
coarse-grained address translation scheme; however, it 
still falls short in terms of performance. Based on the 
results presented in [19], E is about 48% slower than 
transient C/C++ for hot traversals of the OO1 database 
benchmark [4, 5].!! This is a fairly significant consider- 
ing that the overhead of our system is zero for hot traver- 
sals and much smaller (less than 5%) otherwise [8]. 


We believe that there are several reasons why it is likely 
to be quite difficult to drastically reduce the overheads 
of fine-grained techniques. Some of these are: 


e Several of the basic costs cannot be changed or re- 
duced easily. For example, the pointer validity and 
format checks, which are an integral part of fine- 
grained address translation, cannot be optimized 
away. 


e There is a general performance penalty (maintain- 
ing and searching large hash tables, etc.) that is typ- 
ically independent of the checking cost itself. As 
mapping tables get larger, it will be more expen- 
sive to probe and update them, especially because 
locality effects enter the overall picture.!? 


e Complex data-flow analysis and code generation 
techniques are required to optimize some of the 
costs associated with the read barrier used in the 
implementation. Furthermore, such extra optimiza- 
tions may cause unwanted code bloat. 


e Although the residency property can be treated as a 
type so that Self-style optimizations [6] can be ap- 
plied to eliminate residency checking, it is not easy 
to do so; unlike types, residency may change across 
procedure calls depending on the dynamic run-time 
state of the application. As such, residency check 
elimination is fundamentally a non-local problem 
that depends on complex analysis of control flow 
and data flow. 


Based on these arguments, we believe that fine-grained 
translation techniques are comparatively not as attrac- 
tive for high-performance implementations of persis- 
tence mechanisms. 


Taking the other side of the argument, however, it can 
certainly be said that fine-grained mechanisms have their 


‘The hot traversals are ideal for this purpose because they repre- 
sent operations on data that have already been faulted into memory, 
thereby avoiding performance impacts related to differences in load- 
ing patterns, etc. 

'2Hash tables are known to have extremely poor locality because, 
by their very nature, they “scatter” related data in different buckets. 
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advantages. A primary one is the potential savings in I/O 
because fine-grained schemes can fetch data only as nec- 
essary. There are at least two other benefits over coarse- 
grained approaches: 


e fine-grained schemes can support reclustering of 
objects within pages, and 


e the checks required for fine-grained address trans- 
lation may also be able to support other fine- 
grained features (such as locking, transactions, etc.) 
at little extra cost. 


In principle, fine-grained schemes can recluster data 
over short intervals of time compared to coarse-grained 
schemes. However, clustering algorithms are themselves 
an interesting topic for research, and further studies are 
necessary for conclusive proof. We also make another 
observation that fine-grained techniques are attractive 
for unusually-sophisticated systems, e.g., those support- 
ing fine-grained concurrent transactions. Inevitably, this 
will incur an appreciable run-time cost, even if that cost 
is “billed” to multiple desirable features. Such costs may 
be reduced in the future if fine-grained checking is sup- 
ported in hardware. 


5 Mixed-granularity Address Translation 
in Texas 


Pointer swizzling at page fault time usually provides 
good performance for most applications with good lo- 
cality of reference. However, applications that exhibit 
poor locality of reference, especially those with large 
sparsely-accessed index data structures, may not pro- 
duce best results with such coarse-grained translation 
mechanisms. Applications that access big multi-way 
index trees are a good example; usually, such applica- 
tions sparsely access the index tree, that is, only a few 
paths are followed down from the root. If the tree nodes 
are large and have a high fanout, the first access to a 
node will cause all those pointers to be swizzled, and 
possibly reserve several pages of virtual address space. 
However, most of this swizzling is probably unnecessary 
since only a few pointers will be dereferenced. 


The solution is to provide a fine-grained address trans- 
lation mechanism which translates pointers individually, 
instead of doing it a page at a time. Unlike the coarse- 
grained mechanism where the swizzling was triggered 
by an access-protection violation, the actual translation 


of a pointer may be triggered by one of two events— 
either when it is “found’”!> or when it is dereferenced. 


There are many ways of implementing a fine-grained 
(pointer-wise) address translation mechanism as we de- 
scribed above. We have selected an implementation 
strategy that remains consistent with our goals of porta- 
bility and compatibility with existing off-the-shelf com- 
pilers, by using the C++ smart pointer abstraction [17, 
7, 12]. Below, we first briefly explain this abstraction 
and then describe how we use it for implementing fine- 
grained translation in Texas. We also discuss how both 
fine-grained and coarse-grained schemes can coexist to 
create a mixed-granularity environment. 


5.1 Smart Pointers 


A smart pointer is a special C++ parameterized class 
such that instances of this class behave like regular 
pointers. Smart pointers support all standard pointer op- 
erations such as dereference, cast, indexing etc. How- 
ever, since they are implemented using a C++ class 
with overloaded operators supporting these pointer op- 
erations, it is possible to execute arbitrary code as part 
of any such operation. While smart pointers were origi- 
nally used in garbage collectors to implement write bar- 
riers [22, 21], they are also suitable for implementing 
address translation; the overloaded pointer dereference 
operations (via the “*” and “->” operators) can imple- 
ment the necessary translation from persistent pointers 
into transient pointers. 


A smart pointer class declaration is typically of the fol- 
lowing form: 


template <class T> class Ptr 

{ 

public: 
Ptr (T *p = NULL); // 
“Pex ()} // destructor 
T& operator * (); // dereference 
T *operator -> (); // dereference 
operator T * (); /? ‘Gast. to “or e% 


constructor 


}e 


Given the above declaration of a smart pointer class, we 
can then use it as follows: 
'34 pointer is “found” when its location becomes known. This 


is similar to the notion of “swizzling upon discovery” as described 
in [20]. 
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assume defined 
regular pointer 
smart pointer 


class Node; fe] 
Node *node_p; // 
Ptr<Node> node_sp; // 


node_p->some_method() ; 
node_sp->some_method() ; 


Note that we have only shown some of the operators in 
the declaration. Also, we avoid describing the private 
data members of the smart pointer because the inter- 
face is much more important than the internal represen- 
tation; it does not matter how the class is structured as 
long as the interface is implemented correctly. In fact, 
as will be clear from our discussion about variations in 
fine-grained address translation mechanisms, the smart 
pointer will need to be implemented differently for dif- 
ferent situations and implementation choices. 


Smart pointers were designed with the goal of trans- 
parently replacing regular pointers (except for declara- 
tions), and providing additional flexibility because arbi- 
trary code can be executed for every pointer operation. 
In essence, it is an attempt to introduce limited (compile- 
time) reflection [10] into C++ for builtin data types (i.e., 
pointers).'* However, as described in [7], it is impossi- 
ble to truly replace the functionality of regular pointers 
in acompletely transparent fashion. Part of the problem 
stems from some of the inconsistencies in the language 
definition and unspecified implementation dependence. 
Thus, we do not advocate smart pointers for arbitrary 
usage across the board, but they are useful in situations 
where further control is required over pointer operations. 


5.2 Fine-grained Address Translation 


In order to implement fine-grained address translation in 
Texas, we must swizzle individual pointers, instead of 
entire pages at a time, thereby reducing the consump- 
tion of virtual address space for sparsely-accessed data 
structures with high fanout. By using smart pointers for 
this purpose, we allow the programmer to easily choose 
data structures that are swizzled on a per-pointer basis, 
without requiring any inherent changes in the implemen- 
tation of the basic swizzling mechanism. 


Note that although the pointers are swizzled individu- 
ally, the granularity of data fetching is still a page, not 
individual objects, to avoid excessive I/O costs. Below 


'4C++ already provides limited reflective capabilities in the form of 
operator overloading for user-defined types and classes. However, this 
fails to support completely transparent redefinition of pointer opera- 
tions in arbitrary situations. 
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we describe at least two possible ways to handle fine- 
grained address translation, and discuss why we choose 
one over the other. 


5.2.1 Fine-grained Swizzling 


A straightforward way of implementing fine-grained ad- 
dress translation is to cache the translated address value 
in the pointer field itself; we call this fine-grained swiz- 
zling, because the pointer value is cached after being 
translated.'> We chose not to follow this approach be- 
cause of a few problems with the basic technique. 


First, fine-grained swizzling incurs checking overhead 
for every pointer dereference; the first dereference will 
check and swizzle the pointer, while future dereferences 
will check (and find) that the swizzled virtual address 
is already available and can be used directly. A more 
significant problem is presented by equality checks (a 
la the C++ == operator)—when two smart pointers are 
compared, the comparison can only be made after en- 
suring that both pointers are in the same representation, 
that is, either both are persistent addresses or both are 
virtual addresses. In the worst-case scenario, the point- 
ers will be in different representations, and one of them 
will have to be swizzled before the check can complete. 
Thus, a simple equality check, on average, can become 
more expensive than desired. 


One solution is to make the pointer field large enough 
to store both persistent and virtual address values, as in 
E [14, 15]. In the current context, the smart pointer in- 
ternal representation could be extended such that it can 
hold both the pointer fields. This technique avoids the 
overhead on equality checks, which can be implemented 
by simply comparing persistent addresses without regard 
to swizzling, at the expense of additional storage. 


Unfortunately, a more serious problem with fine-grained 
swizzling is presented by its peculiar interaction with 
checkpointing. When a persistent pointer is swizzled, 
the virtual address has to be cached in the pointer field 
(either E-style or otherwise), that is, we must modify 
the pointer. Since virtual memory protections are used 
to detect updates initiated by the application for check- 
pointing purposes, updating a smart pointer to cache the 
swizzled address will generate “false positives” for up- 
dates, causing unnecessary checkpointing. We could 
work around this problem by first resetting the permis- 
sions on the page, swizzling (and caching) the pointer, 


'5The term “swizzling” implies that the translated address is cached, 
as opposed to discarded after use. 
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and then restoring the permissions on the page. How- 
ever, this is very slow on average because it requires ker- 
nel intervention to change page protections. 


5.2.2 Translations at Each Use 


We have seen that a simple fine-grained swizzling mech- 
anism is not as desirable because of its unusual interac- 
tions with the operating system and the virtual memory 
system. However, we can slightly modify the basic tech- 
nique and overcome most of the disadvantages without 
losing any of the benefits. 


The solution is to implement smart pointers that are 
translated on every use and avoid any caching of the 
translated value. In other words, these smart pointers 
hold only the persistent addresses, and must be trans- 
lated every time they are dereferenced because the vir- 
tual addresses are not cached. Equality checks do not 
incur any overhead because the pointer fields are always 
in the same representation and can be compared directly. 


Pointer dereferences also do not incur any additional 
checking overhead. The cost of translating at each use 
does not add much overhead to the overall cost, and is 
usually amortized over other “work” done by the appli- 
cation; that is, the application may dereference a smart 
pointer and then do some computation with the resulting 
target object before dereferencing another smart pointer. 


The advantage of this approach is that the pointer fields 
do not need to be modified because the translated ad- 
dress values are never cached, and all unwanted inter- 
actions with checkpointing and the virtual memory sys- 
tem are avoided. Of course, this approach is still unsuit- 
able as a general swizzling mechanism compared to the 
pointer swizzling at page fault time for reasons described 
in Section 4. 


5.3 Combining Coarse-grained and Fine- 
grained Address Translation 


It is possible to implement a mixed-granularity address 
translation scheme that consists of both coarse-grained 
pointer swizzling and fine-grained address translation. 
The interaction of swizzling with data structures such 
as B-trees can be handled through the use of the smart 
pointer abstraction. The details of a fine-grained address 
translation scheme are hidden, making the approach par- 
tially reflective. 


We have implemented mixed-granularity address trans- 
lation in Texas by combining a fine-grained approach us- 
ing smart pointers that are translated at every use, along 
with the standard coarse-grained approach. This allows 
better programmer control over the choice of data struc- 
tures for which fine-grained address translation is used, 
while maintaining the overall performance of pointer 
swizzling at page fault time. 


6 Performance Measurements 


We present our experimental results for different ad- 
dress translation granularities using the standard OO1 
database benchmark [4, 5] with some minor variations 
as the workload for our experimental measurements. We 
first briefly explain the rationale for choosing the OO1 
benchmark for our performance measurements, then de- 
scribe the experimental design followed by the actual re- 
sults, and finally end with a summary. 


6.1 Benchmark Choice 


Most performance measurements and analysis of persis- 
tent object systems (and object-oriented database sys- 
tems) have been done using synthetic benchmarks in- 
stead of using real applications. There are two reasons 
for this: first, there are few large, realistic applications 
that exercise all persistence mechanisms of the underly- 
ing system and of those that exist, few are available for 
general use; and second, it is typically extremely hard 
to adapt a large piece of code to any given persistence 
mechanism without having a detailed understanding of 
the application. 


The OO1 and OO7 [3] benchmarks have become quite 
popular among various benchmarks, and have been used 
widely for measuring the performance of persistent sys- 
tems. However, we posit that these benchmarks are 
not representative of typical real-world applications, be- 
cause they have not been validated against applications 
in the domain they represent; other researchers [18] have 
also reached similar conclusions. As such, the experi- 
mental results from these benchmarks should be inter- 
preted with caution. The apparently “empirical” nature 
of these “experimental” results is likely to lull people 
into relying on the results more than appropriate. It is 
important to always remember that while the results are 
obtained empirically, they are ultimately derived from a 
synthetic benchmark and are only as good as the map- 
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ping of benchmark behavior onto real applications. 


Although OO1 is a crude benchmark and does not 
strongly correspond to a real application, we use it for 
several reasons. First, OO1 is simple for measuring raw 
performance of pointer traversals (which is what we are 
interested in) and is fairly amenable to modifications for 
different address translation granularities. Use of a syn- 
thetic benchmark is also appropriate in this situation be- 
cause our performance is very good in some cases (i.e., 
zero overhead when there is no faulting) and dependent 
on the rate of faulting (usually minimal overhead com- 
pared to I/O costs) for other cases. As such, crude bench- 
marking is the most practical way to measure perfor- 
mance of different components of our system because 
it is easy to separate our costs from those of the underly- 
ing benchmark; this is usually more difficult with a real 
application. Further discussion on sythetic benchmarks 
and their applicability is available in [8]. 


6.2 Experimental Design 


The benchmark database is made up of a set of part ob- 
jects interconnected to each other. The benchmark spec- 
ifies two database sizes based on the number of parts 
stored in the database—a small database containing 
20,000 parts and a large database containing 200,000 
parts—to allow performance measurements of a sys- 
tem when the entire database is small enough to fit into 
main memory and compare it with situations where the 
database is larger than the available memory. 


The parts are indexed by unique part numbers associ- 
ated with each part.'© Each part is “connected” via a 
direct link to exactly three other parts, chosen partially 
randomly to produce some locality of reference. In par- 
ticular, 90% of the connections are to “nearby” 1% of 
parts where “nearness” is defined in terms of part num- 
bers, that is, a given part is considered to be “near” other 
parts if those parts have part numbers that are numeri- 
cally close to the number of this part. The remaining 
10% of the connections are to (uniformly) randomly- 
chosen parts. 


We use the OO1 benchmark traversal operation (perform 
a depth-first traversal of all connected parts starting from 
a randomly-chosen part and traversing up to seven levels 
deep for a total of 3280 parts including possible dupli- 
cates, and invoke an empty procedure on each visited 
part) for our performance measurements. Each traver- 


'6The benchmark specification does not define a data structure that 
must be used for the index; we used a B+ tree for all our experiments. 
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sal set contains a total of 45 traversals split as follows: 
the first traversal is the cold traversal (when no data is 
cached in memory), the next 34 are warm traversals (as 
more and more data is cached in memory) and finally 
the last 10 are hot traversals (when all data is cached in 
memory).!7 We use a random number generator to en- 
sure that each warm traversal selects a new “root” part as 
the initial starting point, thus visiting a mostly-different 
set of parts in each traversal. 


6.3 Experimental Results 


We present results for the OO1 traversal operations cor- 
responding to different address translation granularities 
for the data structures used during the traversals. In 
particular, we are interested in three different address 
translation granularities, namely coarse-grained, mixed- 
granularity and fine-grained strategies. The following 
table describes the types of pointers used for each gran- 
ularity and the corresponding key in the results. 












We use CPU time!® instead of absolute real time be- 
cause the difference in performance is primarily due to 
differences in faulting and swizzling, and allocating ad- 
dress space for reserved pages. Unfortunately, CPU- 
time timers on most operating systems have a coarse 
granularity (typically in several milliseconds), and it 
would be impossible to measure any reasonable differ- 
ences in the performance due to a change in the address 
translation granularity because our overheads are very 
small. Thus, we use an older SPARCstation ELC, which 
is slow enough to offset the coarse granularity of the 
timers, while providing reasonable results. 


Figure 3 presents the CPU time for all traversals in an 
entire traversal set run on a large database. As expected, 
the cost for “all-raw” case (coarse-grained address trans- 
lation) is the highest for the first 15 or so traversals. 
This is not unusual because the coarse-grained address 
translation scheme swizzles all pointers in the faulted- 
on pages and reserves many pages that may never be 


This is different from the standard benchmark specification con- 
taining only 20 traversals (split as 1 cold, 9 warm, and 10 hot traver- 
sals); we run more warm traversals because we believe that 9 traversals 
are not sufficient to provide meaningful results, especially for the large 
database case. 

'8We refer to the sum of user and system time as the CPU time. 
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Figure 3: CPU time for traversal on large database 


used by the application. This is exacerbated by the 
poor locality of reference in the benchmark traversals 
as many pages of the database are accessed during the 
initial traversals, causing a large number of pages to be 
reserved. The number of new pages swizzled decreases 
as the cache warms up, and we see the corresponding 
reduction in the CPU time. 


Note that the cost for the “all-smart” case (fine-grained 
address translation) is the lowest for the first 15 traver- 
sals. Again, this is expected because the address trans- 
lation scheme does not swizzle any pointers in a page 
when it is faulted in because they are all smart pointers 
that must be translated at every use. Finally, the CPU 
time for the “smart-index” case (mixed-granularity ad- 
dress translation) falls between the other two cases for 
the first 15 traversals. This is also reasonable because 
only the index structure contains smart pointers, and 
each traversal uses this index only once (to select the root 
part for the traversal). This cost is only slightly less than 
the “all-raw” case because our B+ tree implementation 
generated a tree that was only three levels deep, reducing 
the number of smart pointers that had to be translated for 
each traversal. 


Now consider the hot traversals (36 through 45). The 
first thing to note is that the CPU time for the “all-smart” 
case is higher than that for the other two cases. This is 
because smart pointers impose a continual overhead for 
each pointer dereference, and this cost is incurred even 
if the target object is resident. In contrast, the “all-raw” 
case has zero overhead for hot traversals.!° 


'°The “smart-index” results should be identical to the “all-raw”’ re- 


Figure 4 shows the corresponding results for the small 
database, where only the first 3 or 4 traversals contain 
faulting and swizzling.?? Once again, a phenomenon 
similar to the one in large database results can be seen in 
the current results, but only for the initial traversals. In 
particular, the CPU time is highest for the “‘all-raw” case 
and lowest for the “all-smart” case. Also as before, the 
two granularities swap their positions for the hot traver- 
sals; the “all-smart” case is more expensive because of 
the continual translation overhead at every use. Finally, 
as expected, the “‘all-raw” and “smart-index” results are 
identical for hot traversals because no index pointers are 
dereferenced. 


6.4 Summary 


The results presented above support our assertion that 
fine-grained address translation can be effectively used 
for data structures with high fanout that are less con- 
ducive for a coarse-grained scheme. At the same time, 
a pure fine-grained approach is not the best performing 
as the primary address translation mechanism in Texas 
because of various overheads associated with it. 


One problem with using the OO1 benchmark is that the 
operations do not perform any real computation (unlike 


sults for hot traversals because there is no index lookup, and no smart 
pointers need to be translated. We attribute the difference between the 
hot results in these two cases to caching effects. 

20Most of the database is memory-resident within the first few 
traversals because of the extremely poor locality characteristics in the 
connections. 
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Figure 4: CPU time for traversal on small database 


in actual applications) with objects that are traversed. 
As such, the cost of fine-grained translation is high- 
lighted as a larger component of the total cost than it 
typically would be in an actual application that performs 
real “work” on data objects as they are traversed. 


7 Conclusions 


We presented a discussion on address translation strate- 
gies, both in the context of the Texas persistent store 
and for general persistence implementations. We also 
proposed a new classification for persistence in terms of 
granularity choices for fundamental design issues rather 
than using taxonomies based only on address translation 
semantics, and discussed each choice that we made in 
Texas. 


We also discussed issues related to fine-grained address 
translation, including their inherent costs that make them 
unsuitable as the primary address translation mechanism 
in a persistence implementation. Instead, we discussed 
how a mixed-granularity approach can be used to selec- 
tively incorporate fine-grained address translation in the 
application. 


We presented our implementation of mixed-granularity 
address translation in Texas which combines the C++ 
smart pointer idiom for the fine-grained translation com- 
ponent with the normal pointer swizzling at page fault 
time mechanism for the coarse-grained translation com- 
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ponent, while maintaining portability and compatibility 
of the system. 


Our basic performance results using the OO1 benchmark 
have shown that the mixed-granularity approach works 
well for applications with data structures that do not pro- 
vide the best performance with a pure coarse-grained 
approach. However, further performance measurements 
are necessary, especially using real applications instead 
of synthetic benchmarks which do not always model re- 
ality very well. 
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Abstract 


JMAS is a prototype network computing infrastruc- 
ture based on mobile actors {10] using Java technology. 
JMAS requires a programming style different from 
commonly used approaches to distributed computing. 
JMAS allows a programmer to create mobile actors, 
initialize their behaviors, and send them messages us- 
ing constructs provided by the JMAS Mobile Actor 
API. Applications are decomposed by the programmer 
into small, self-contained sub-computations and dis- 
tributed among a virtual network of Distributed Run- 
Time Managers (D-RTM); which execute and manage 
all mobile computations. This system is well suited 
for course grain computations for network computing 
clusters. Performance evaluation is done using two 
benchmarks: a Mersenne Prime Application, and the 
Traveling Salesman Problem. 

Keywords: Distributed systems, parallel computing, 
actor model, mobile agents, actors, network comput- 
ing 


1 Introduction 


Multicomputers represent the most promising de- 
velopments in computer architecture due to their eco- 
nomic cost and scalability. With the creation of faster 
digital high bandwidth integrated networks, hetero- 
geneous multicomputers are becoming an appealing 
vehicle for parallel computing, redefining the concept 
of supercomputing. As these high bandwidth con- 
nections become available, they shrink distances and 
change our models of computation, storage, and in- 
teraction. With the exponential growth of the World 
Wide Web (WWW), the web can be used to exploit 
global resources, such as CPU cycles, making them 
available to every user on the Internet [7, 30]. The 
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combined resources of millions of computers on the 
Internet can be harnessed to form a powerful global 
computing infrastructure consisting of workstations, 
PCs, and supercomputers (Figure 1). 

Aa 
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Figure 1 Global Computing Infrastructure. 


fo 


The vision of integrating network computers into 
a global computing resource is as old as the 
Internet[7][23]. Such a system should hide the un- 
derlying physical infrastructure from users and from 
programmers, provide a secure environment for re- 
source owners and users, support access and location 
of large integrated objects, be fault tolerant, and scale 
to millions of autonomous hosts. Some recent network 
computing approaches include CONDOR [29], MPI 
[24], PVM [31], Piranha [22], NEXUS [28], Network of 
Workstations (NOW) [3], Legion [23], and GLOBUS 
[20]. These network computing frameworks use low- 
level communication systems, or high-level dedicated 
systems. Although these systems offer heterogeneous 
collaboration of multiple systems in parallel, they in- 
volve rather complex maintenance of different binary 
codes, multiple execution environments, and complex 
underlying architectures. 

Distributed computing over networks, has emerged 
as a technology with tremendous promise and po- 
tential, owing in part to the emergence of the 
Java Programming Language and the World Wide 
Web. Recently, researchers have proposed several 
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approaches to provide a platform independent Java- 
based high-performance network computing infras- 
tructure. These include Javalin [16], WebFlow [4], 
IceT [14], JavaDC [15], Parallel Java [26], Parallel Java 
Agents [27], ATLAS [5], Charlotte [6], ParaWeb [9], 
Popcorn [12], and Ninflet [32]. The use of Java as a 
means for building distributed systems that execute 
throughout the Internet has also been recently pro- 
posed by Chandy et al. [13], Fox et al. [21] and imple- 
mented in [33]. Java, because of its platform indepen- 
dence, overcomes the complexity issues of maintain- 
ing different binary codes, multiple execution environ- 
ments, and complex underlying architectures. It offers 
the basic infrastructure needed to integrate computers 
connected to the Internet into a distributed compu- 
tational resource for running parallel applications on 
numerous anonymous machines. 

Mobile agents are a convenient paradigm for dis- 
tributed computing [8, 18]. The agent specifies when 
and where to migrate, and the system handles the 
transmission. This makes mobile agents easier to use 
than low-level facilities in which the programmer must 
explicitly handle communication, but more flexible 
and powerful than schemes such as process migration 
in which the system decides when to move a program 
based on a small set of fixed criteria. Mobile agents al- 
low a distributed application to be written as a single 
program. 

In this paper we discuss the design and imple- 
mentation of a prototype network computing system 
(JMAS) based on the mobile actor model [10] using 
Java technology [25]. The mobile actor model is a par- 
allel programming paradigm for distributed parallel 
computing based on mobile agents and the actor mes- 
sage passing model [1]. Applications are decomposed 
by the programmer into small, self-contained subcom- 
putations and distributed among a virtual network of 
Distributed Run-Time Managers (D-RTM); which ex- 
ecute and manage all mobile computations. Lastly, we 
evaluate the performance of our system, and show that 
our system is well suited for course grain computations 
in a network computing environment. Our experi- 
ments were ran using two benchmarks: a Mersenne 
Prime Application, and the Traveling Salesman Prob- 
lem. 


2 The Actor Model 


Actors are self-contained, interactive, autonomous 
components of a computing system that communicate 
by asynchronous message passing. Actors are charac- 
terized by an identity (ie. mail address), a mailbox, 
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and a current behavior. Moreover, a mail address may 
be included in messages sent to other actors - this al- 
lows those actors to communicate with the actor whose 
mail address they have received. The ability to com- 
municate mail addresses of actors implies that the in- 
terconnection network topology of actors is dynamic. 
This dynamic interconnection network topology im- 
plies that the underlying resources can be represented 
as actors to build a system architecture. Each time 
an actor processes a communication, it also computes 
its behavior in response to the next communication 
it may process. Acquaintances represent actors whose 
mail addresses are known to the actor. Because all ac- 
tor communication is asynchronous, all messages are 
buffered in mail queues until the actor is ready to re- 
spond to them. Messages sent are guaranteed to be 
received with an unbounded but finite delay. 

Each actor may be thought of as having two aspects 
that characterize their behavior: 


1. its acquaintances which is the finite collection 
of actors that it directly knows about; 


2. the action it should take when it is sent a mes- 
sage. These actions provide a primitive set of 
operations to: 


© send messages asynchronsly to specified ac- 
tors, 


e create actors with specified behaviors, and 


e become a new actor, assuming a new behavy- 
ior to respond to the next message. 


The actor primitive operators (i.e. send, create, and 
become) form a simple but powerful set on which to 
build a wide range of higher-level abstractions and 
concurrent programming paradigms. Although there 
is sufficient research supporting the actor model to 
solve fine/large grain applications on a tightly cou- 
pled system, there has been no actor-based solution 
to solve large scale data intensive distributed appli- 
cations which may be interconnected by costly com- 
munication links. In order to support this environ- 
ment, locality of reference and resource management 
(ie. load balancing) must be addressed; as processes 
must be able to migrate throughout the system. In 
the next section, we address the issue of locality of 
reference and resource management through actor mo- 
bility. We present a communication paradigm among 
mobile agents that incorporates actor-based message 
passing to support dynamic architecture topologies for 
distributed parallel computing. 
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3 The Mobile Actor Paradigm 


A mobile actor is an actor with the semantics of 
mobility and navigational autonomy. Navigational 
autonomy is the degree to which a message can be 
viewed as an object with its own innate behavior, ca- 
pable of making decisions about its own destiny. The 
actor model inherently enforces navigational auton- 
omy allowing addresses of actors to be communicated 
and thus providing a dynamic interconnection network 
topology. Such a computing model provides support 
to deal with non-deterministic problems which require 
network reconfigurations, non-deterministic communi- 
cation, and dynamic process coordination. In many 
practical distributed applications, the over consump- 
tion of local resources don’t allow computations to be 
processed efficiently. A more feasible solution would 
be to migrate the process to least consumed resources, 
or to move the process to a data server or communica- 
tion partner in order to reduce network load by access- 
ing a data server or communication partner by local 
communication. The mobile actor model is a strategy 
for remote execution and process migration using the 
actor-message passing paradigm (i.e. for load balanc- 
ing, and locality of reference of data/behaviors). A 
remote execution includes the transport and start of 
execution of a process on a remote location. Process 
migration includes the transport of process code, ex- 
ecution state, and data of the process; processes may 
be restarted from their previous state. The execution 
of computations may migrate across file systems con- 
sisting of networks of computers and/or computing 
clusters. 

We extend the actor primitive operations in re- 
sponse to a message with semantics to support actor 
mobility. The semantics of actor mobility are enforced: 
upon receipt of a message, or when dynamically cre- 
ating another actor on a remote location. These ex- 
tended primitive operations allow computations to mi- 
grate after state change. 

The behavior of mobile actors consists of two kinds of 
actions in response to a message: 


1. becomeremote Computes a replacement behavior 
on the local machine and migrates to a location 
on aremote machine. The migrated actor is char- 
acterized by the identity (i.e. it’s mail address), 
and mailbox of a specified location of an actor on 
a remote machine. 


2. createremote anew actor on the local machine and 
migrate to the remote location, assuming a new 
behavior to respond to the next message. 
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4 JMAS: A Java-Based Mobile Actor 
System 


Exploiting the resources of several interconnected 
computers to form a powerful network computing in- 
frastructure is the goal of this research. Such an infras- 
tructure should provide a single interface to users that 
provides large amounts of computing power, while hid- 
ing from users the fact that the system is composed 
of hundreds to thousands of machines scattered across 
the country. Our vision is to create a system in which 
a user sits at a workstation, and has the illusion of a 
single very powerful computer. In this section, we dis- 
cuss the technical issues associated with the construc- 
tion of a network computing infrastructure which exe- 
cutes mobile actor computations. A mobile actor sys- 
tem is a multi-user, heterogeneous, network comput- 
ing environment for executing distributed actor-based 
computations. A mobile actor system must support 
two basic tasks - the creation and migration of re- 
mote actors, and the communication between actors 
distributed throughout the system. In addition the 
system should: 


e provide language support for the mobile actor 
programming model, 


e provide a single consistent namespace for actors 
within the system, 


e provide an efficient execution schedule between 
actors maintained on the local machine, 


e be able to distribute the load evenly among the 
machines participating within the distributed sys- 
tem, 


e be fault tolerant, and 


e be secure. 
4.1 JMAS Infrastructure 


JMAS is a network computing environment for exe- 
cuting mobile actor computations. JMAS is designed 
using Java technology [25], and requires a program- 
ming style different from commonly used approaches 
to distributed computing. JMAS allows a program- 
mer to create mobile actors, initialize their behaviors, 
and send them messages using constructs provided by 
the JMAS Mobile Actor API. As the computation un- 
folds, mobile actors have the ability to implicitly nav- 
igate autonomously throughout the underlying net- 
work. New messages are generated, new actors are cre- 
ated, and existing actors undergo state change. JMAS 
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also makes mobile actor locality visible to program- 
mers to give them explicit control over actor place- 
ment. However, programmers still do not need to 
keep track of the location to send a message to a mo- 
bile actor. Data flow and control flow of a program 
in JMAS is concurrent and implicit. A programmer 
thinks in terms of what an actor does, not about how 
to thread the execution of different actors. Communi- 
cation of mobile actors is point-to-point, non-blocking, 
asynchronous, and thus buffered. 


4.2 Language Support in JMAS 


JMAS is based on the Java Programming Language 
and Virtual Machine of JDK1.1 [25]. JDK1.1 contains 
mechanisms that allow objects to be read/written to 
streams (object serialization), as well as, an API that 
provides constructs to dynamically build objects at 
run-time (i.e. Reflection package [java.lang.reflect]). 
We exploit heterogeneity through Java’s platform in- 
dependent (i.e. write once run anywhere) framework. 
We provide a Mobile Actor API for developing mo- 
bile actor applications using the Java Programming 
Language. Mobile actor programs are compiled using 
a Java compiler that generates Java bytecode. Java 
bytecode can be executed on any machine containing 
a Java Virtual Machine. Actors in JMAS are light- 
weight processes called threads. The API provides 
constructs which allow programmers to create mobile 
actors using static or dynamic placement, to change 
an actor’s state, and send communications to an actor. 


4.3 Consistent Mobile Actor Names in 
JMAS 


JMAS implements a simple location-dependent 
naming strategy tightly coupled with mobile actors 
within the system. Each mobile actor within the sys- 
tem is given a globally unique identifier. This identifier 
is bound to only one address by the underlying mes- 
sage system. These bindings may change over time; 
if for example, a mobile actor migrates to a different 
machine. In such a case, messages are forwarded to 
the new location by the underlying message system. 
It has been shown in [11], that forwarding messages in 
a distributed system consisting of N machines requires 
in the worst case NV — 1 message rounds. 


4.4 Scheduling and Load Balancing in 
JMAS 


The JVM implements a timeslice schedule of 
threads on Window95 systems, and a pre-emptive 


priority-based schedule for UNIX/Windows NT sys- 
tems. JMAS forces a pre-emptive, priority-based 
schedule among local threads; regardless of the under- 
lying architecture. The efficiency of an actor-based 
computation on a loosely coupled architecture de- 
pends on where different actors are placed and the 
communication traffic between them. Thus, the place- 
ment and migration of actors can drastically affect 
the overall performance. We implement a decentral- 
ized fault-tolerant load balancing scheme based on the 
CPU market strategy proposed in [12]. The market 
strategy is based on CPU-time. Entities within the 
system consist of buyers and sellers. A seller allows 
its CPU to be used by other programs. A buyer serves 
as a machine wanting to off-load work to a seller. A 
meeting place in which buyers and sellers are corre- 
lated is known as a market. Computations are dis- 
tributed to seller using a round-robin schedule. This 
strategy is intended for coarse-grain applications. 


4.5 Security in JMAS 


Security issues are not addressed in this version of 
the prototype system. Policies could be enforced to 
encrypt/decrypt all Java class files and messages sent 
throughout the system. Use of any strategy will com- 
promise the overall performance of the system. 


4.6 Fault Tolerance in JMAS 


Machines used within the JMAS infrastructure are 
fault tolerant to the extent necessary without com- 
promising overall system performance. The limit of 
our concern is with fail-stop faults of hardware com- 
ponents, and the network. The underlying communi- 
cation system will guarantee the delivery of messages 
through the use of reliable, communication-oriented 
TCP sockets. Further, if a host should fail, then JMAS 
will remove that host from the current CPU Market 
configuration. 


5 JMAS Architecture 


The architecture of JMAS is organized as a series 
of layers or levels, each one built upon its predeces- 
sor (Figure 2). The lowest level(physical layer) is 
the actual physical network, which may consist of a 
LAN/WAN of PCs and/or workstations. It could also 
represent a global network such as the Internet. The 
second layer (daemon layer) consists of the collection 
of daemons residing on all physical machines partici- 
pating in the distributed system. Each daemon listens 
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on a reserved communication port receiving commu- 
nications that could consist of messages or migrating 
computations. Upon receipt of a communication, it 
is passed to the third layer. The third layer consists 
of Distributed Run-Time Managers (D-RTM). The 
D-RTM is responsible for message handling from/to 
local/remote processes, scheduling and load balancing 
of processes. The forth layer (logical layer), consist 
of the actual application specific computations on the 
local machine. Computations are expressed as mobile 
actors. Each actor is encapsulated with a behavior, 
an identity, a mail queue, and one thread. The logical 
layer shows each actor and its acquaintances (i.e. A 
knows about B and C, ...etc). 


Four Layer Mobile Actor 
Architecture. 


Figure 2. 


In the following sections, we give a detailed descrip- 
tion of the JMAS architecture. In particular, we dis- 
cuss the components of each layer, and show how Java 
technology is applied. 


5.1 Physical Layer 


The physical layer is the actual physical network, 
which may consist of a LAN/WAN of PCs and/or 
workstations. These systems are referred to as scalable 
computer clusters (SCCs), or networks of worksta- 
tions (NOWs) [3]. Both systems are developed within 
a trusted environment. Therefore security issues are 
not a major concern. The disadvantage is that the 
scalability of these systems is limited to the resources 
available to the system administrator. The physical 
layer could also represent interconnected networks of 
computer clusters. 


5.2 Daemon Layer 
The daemon layer is implemented as a collection 
of daemon threads residing on all physical nodes par- 


ticipating in the JMAS distributed environment. The 
responsibility of the daemon thread is to continuously 


USENIX Association 


monitor the network, receiving local/remote commu- 
nication messages and mobile computations arriving 
from other machines. JMAS supports a messages- 
driven model of execution (Figure 3). 
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Figure 3. Message-driven model of 


execution. 


There is no local/remote peer-to-peer communication 
between mobile actors within the system. All commu- 
nication is routed through a reserved port of a daemon 
thread residing on the local machine. The reserved 
port for JMAS is 9000. Message reception by the dae- 
mon thread creates a thread within the actor which 
executes the specified method with the message as its 
argument. Only message reception can initiate thread 
execution. Furthermore, thread execution is atomic. 
Once successfully launched, a thread executes to com- 
pletion without blocking. 


5.3 Distributed Run-Time Manager 


The Distributed Run-Time Manager (D-RTM) is 
the most complex of the four layers. It is contained 
within each daemon in the system. Therefore, the 
daemon layer and D-RTM layer are tightly coupled. 
The D-RTM contains the basic underlying software 
that provides the transparent interface to the network 
computing system. The D-RTM was designed using a 
layered virtual machine design built on top of the Java 
Virtual Machine (JVM) using JDK1.1 [25] (Figure 4). 
The D-RTM has several functions: 


e To handle all incoming Tasks (i.e. 
Handler) 


Message 


e To prepare actor processes to run on the local 
system (i.e. Actor Context) 


e To load java bytecode (e.g. java objects) from 
local/remote locations(i.e. BehvLoader) 


e To schedule local/remote threads using a pre- 
emptive, priority schedule (i.e. Scheduler), 
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e To manage the CPU load on the local machine 
(i.e. Load Balancer). 
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Figure 4. Distributed Run-Time Manager 


(D-RTM). 


5.3.1 Message Handler 


The message handler is responsible for routing Tasks 
which consist of communications to local actors. As 
illustrated in Figure 5, messages are stored in a table 
of message queues (i.e. mailboxes). A mailbox could 
have one or more actors within the local actor con- 
text associated to it. We implement the table of mail- 
boxes as a hash table. We use Java’s Hashtable class 
provided by the java.util package. Because Java im- 
plements its Hashtable as a synchronized object, each 
access to the Hashtable is atomic. This is very use- 
ful for our multi-threaded environment. Each mail 
address hashes to one mailbox in the table. In or- 
der to achieve maximum parallelism, the table is ac- 
cessed by subprocesses. Messages from a desired mail- 
box are forwarded asynchronously to actor processes 
whose identity is denoted by the mail addresses of the 
mailbox. 




















Message Handler. 


Figure 5. 


5.3.2 Actor Context 


The Actor Context is responsible for instantiating an 
object, wrapping the object within a thread, and sup- 
plying the thread to the Scheduler. It also maintains 
a table of system information. Such as: 


The actor Identity 
e The current behavior 


e The current method (communication being exe- 
cuted) 


The total (idle) time actor waited in ready queue 
before receiving a communication (msec) 


The total time to load the actor (msec) 


e The current running time (msec) 


Objects in JMAS are built during runtime. In- 
formation about an object during runtime is ob- 
tained using Java Reflection [25]. The classes needed 
to perform these operations are obtained from the 
java.lang.reflect package of the JDK1.1. 


5.3.3 Scheduler 


JMAS implements a pre-emptive, priority-based 
scheduler among local threads. Each thread is as- 
signed a priority that can only be changed by the 
programmer. The thread that has the highest pri- 
ority is the current running thread. Processes with a 
lower priority are interrupted. To ensure that star- 
vation does not exists among threads we implement 
a round-robin schedule among local processes. As il- 
lustrated in Figure 6(a), incoming threads or threads 
instantiated locally, are given a priority—initially low. 
Threads are then placed into a queue data structure. 
The scheduler dequeues a thread from the list and as- 
signs it the highest possible priority—causing the this 
thread to run. After a given time ¢, the thread is 
stopped and inserted back into the list. This process 
continues until all threads within the list terminate 
(Figure 6(b)). The scheduler could be interrupted by 
the load balancer; if the CPU reaches its computa- 
tion threshold. This will cause the current running 
thread to suspend and migrate to a remote machine 
to continue its execution. Computations are migrated 
to remote locations using a round-robin schedule. 
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Figure 6. Thread Scheduler. 
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5.3.4 ClassLoader 


In order to load classes from remote locations, we im- 
plemented our own classloader. The BehvLoader al- 
lows classes to be loaded over the network and stored 
within the local cache. The BehvLoader loads classes 
to the interpreter using the following sequence of op- 
erations (Figure 7). 


1. Check if the class already exists in the local cache. 
If not, 


2. Check if the class is a system class. If not, 
3. Check the local disk. If not found, 


4, Check the remote disk where the request origi- 
nated. If not found, 


5. NoSuchClassFound exception is thrown. 


JINIAS = 
BehvLoader 





NoClossFoundException 


Figure 7. Operation of JMAS ClassLoader. 


Different features can be added to the BehvLoader to 
provide security. Such as: 


e encryption/decryption of class files 


e use of signatures 


5.3.5 Load Balancer 


We implement a load balancing scheme based on the 
CPU market strategy proposed in [12]. The market 
strategy is based on CPU-time. Entities within the 
system consist of buyers and sellers. A seller allows 
its CPU to be used by other programs. A buyer serves 
as a machine wanting to offload work to a seller. A 
meeting place in which buyers and sellers are corre- 
lated is known as a market. CPUs are chosen from 
the market using three selection policies: 


1. Optimal (Best) selection, 
2. Round-Robin selection, or 


3. Random selection. 
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5.3.5.1 Developing a Market of CPUs 


We implement a decentralized hierarchical method 
for organizing the CPU market. Each machine within 
the system is responsible for managing a market. 
Therefore, the process of managing a market is dis- 
tributed throughout the system—increasing market re- 
liability and availability. When starting the system, 
the D-RTM initializes its market by registering it- 
self with machines designated within a configuration 
file set by the system administrator. Those machines 
willing to sell their CPU respond with a message 
SELLER, and are added to the market as sellers. Ma- 
chines who wish to buy CPU time respond with a mes- 
sage BUYER, and are added to the market as buyers. 
Those who do not respond (i.e. system down) are not 
added to the market. This market maintained by the 
D-RTM, contains the secondary machines on which to 
off-load remote processes. As shown in Figure 8, this 
creates a logical hierarchy of machines. Each node 
within the hierarchy, with the exception of the bottom 
most nodes, are denoted as market managers. Com- 
munication overhead is minimal. CPUs wishing to sell 
their time add themselves to the market by notifying 
a market manager (Figure 8). Buying from the mar- 
ket is a bottom up process. Nodes at the lowest level 
become overloaded faster. Once a given node X is 
denoted as a buyer, all nodes who are descendants of 
X are also denoted buyers. This approach requires 
collaboration among system administrators to orga- 
nize an optimal hierarchy. This is not suitable for a 
global environment which must scale to hundreds or 
thousands of machines. 
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Figure 8. CPU Market Hierarchy. 


We modify the hierarchical method, by allow- 
ing market initialization and registration to be bi- 
directional. Not only does the D-RTM register itself 
with machines designated by the system administra- 
tor, but machine also registers itself with the D-RTM. 
In such a situation, the market is organized by man- 
agers who are logically connected in a (complete) mul- 
tidirectional topology. Because machines belong to 
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more than one market, this configuration increases the 
communication overhead substantially. Communica- 
tion increased from one message round to an expensive 
multicast. As shown in Figure 9, not only do machines 
B, C, and D notify machine A when buying or sell- 
ing their CPU time, but, machine A must also notify 
machines B,C’, and D when buying or selling its CPU 
time. Changes in the CPU status (i.e. Buyer/Seller), 
are notified to all machines within a market using a 
weak consistent replication strategy. We use weak con- 
sistent replication in order to reduce the communica- 
tion over head. Notifications are replicated through- 
out the system by piggybacking the CPU status of 
the current machine along with communication sends. 
For example: when an actor on machine B receives 
a communication from and actor on machine A, the 
CPU market on machine B is updated with the new 
CPU status of machine A. Although, machines are not 
instantly notified of a market change, use of this weak 
replication strategy provide eventual message delivery 
that is tolerated in our system [11]. 
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Figure 9. Host A Notifies Markets of B,C, 


and D. 


5.3.5.2 Load Balancing Policy 


Each machine within the distributed system maintains 
a data structure with information about the current 
machines within its market. These machines are de- 
noted as buyers, or sellers. The load factor on the 
machine is relative to the number of threads currently 
running on the local machine. Other factors could 
also be used to determine the load. Such as: the to- 
tal load on the machine, heuristic information, the 
actual CPU utilization, and the size of the computa- 
tion. Most of these metrics are more complicated to 
determine. As shown in Figure 10, the Load Balancer 
maintains a load below 75% of the threshold, and 25% 
of the threshold above the minimum load (i.e. zero). 


HOVER 
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Figure 10. Load Balancing Policy. 


Before starting a thread on the local machine, the load 
balancer checks the current load to insure that it is 
within the threshold. If the load is not within the 
current threshold, the load balancer off-loads a local 
process to machines within its market who wish to sell 
their CPU (Figure 11). If there are no sellers within 
the market, the load balancer starts the process lo- 
cally, and tries to off-load processes later. Note that 
the D-RTM is now a buyer of CPU time and needs to 
inform its market managers of its new status. By de- 
fault the status of a machine is seller. Therefore this 
field is changed to status buyer. 


Variable Definitions: 


t: Task (communication sent throughout system) 

load : Integer to denote the current load on the local machine 
Threshold : Integer to denote the load limit on the local machine 
BUYER,SELLER : constant to denote the state of the machine 


CPUStatus : enumerator to denote the state of the machine 
(BUYER/SELLER) 


host: contains the host location of an available CPU 


scheduleLocal(t) : schedules the Task ¢ (i.e. an actor) on the 
local machine 


scheduleRemote(t, host) : schedules the Task ¢ (i.e. an actor) at 
the location host 


getAvailHost() : returns an available CPU (SELLER) from the 
market, 


updateMarket(t) : update the CPUStatus of the machine from 
which the Task t originated 


LoadBalancer : 


1. Receive Task t 


2. Ift is an actor 


if load + 1 < Threshold, then 
set CPUStatus to SELLER 
scheduleLocal(t) 
increment load by 1 
Else 
set CPUStatus to BUYER 


host = getAvailHost() 
scheduleRemote(t, host) 


Else if t is a communication 


updateMarket(t) 
forward task to Message Handler 


3. goto l 


Figure 11. Load Balancing Algorithm. 
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5.4 Logical Layer 


The logical layer consists of the actual application 
specific computations that are executing on the local 
machines. The computation model consists of mobile 
actors which encapsulate: a behavior, an identity, a 
mail queue, and one thread. Each computation runs in 
its own thread, and may communicate with any other 
thread on the local/remote machines. Computations 
are expressed as Java programs using mobile actor se- 
mantics provided by constructs of the JMAS Mobile 
Actor API. The mobile actor API gives programmers 
the ability to create actors, change the state, or send 
communications to mobile actors within the global 
system. The underlying resources can be logically rep- 
resented as mobile actors to build dynamic architec- 
ture topologies. This dynamic architecture gives the 
programmer an illusion of a global computer that can 
run concurrent, distributed, and parallel applications. 
Implementation details of the underlying system are 
transparent to the programmer in the logical layer. 


6 Performance Evaluation 


JMAS offers the basic infrastructure needed to in- 
tegrate computers connected by a network into a dis- 
tributed computational resource: an infrastructure for 
running coarse-grain parallel applications on several 
anonymous machines. Currently, cluster computing in 
a LAN setting are already being used extensively to 
run computation intensive applications [17],[19]. We 
conducted our experiments in an environment consist- 
ing of: 


e 1 Sun MicroSystems Enterprise 3000, configured 
with two UltraSparc processors each running at 
256MHz. 


e 1 Sun Ultra Sparc workstations, configured with 
one 120 MHz processor. 


e 14 Sun Sparc 20 workstations, each configured 
with one 200 MHz processor. 


e 1 Sun Sparc 10 workstations, configured with one 
166 MHz processor. 


Each machine is connected by a 10 and 100 Mbit Eth- 
ernet. All experiments were conducted under the typ- 
ical daily workloads. We tested each algorithm under 
a controlled environment of D-RTMs that were used 
strictly to run our experiments. CPU selection from 
the CPU market, was performed by the D-RTM using 


USENIX Association 


a round-robin selection policy. Under our controlled 
environment, an optimal selection policy achieves the 
same results as round-robin CPU selection. We did 
not run our experiments using a random CPU selec- 
tion policy. This was done to insure that all processes 
mapped to one and only one machine. In order to 
obtain a relative performance of our system, we cal- 
culate the average of the execution times over NV = 10 
experiments, producing an arithmetic mean (AM): 


1 N 
AM = — 5 Time; 
N - ume 


Where Time; is the execution time for the ith ex- 
periment. All experiments are compared with perfor- 
mance metrics obtained from similar computations on 
stand-alone workstations. 


6.1 Benchmarks 


The overhead of migrating actors to remote loca- 
tions and passing messages between remote actors are 
of great interest. We present experimental results 
for our prototype using two benchmarks: a Traveling 
Salesman application, and a Mersenne Prime appli- 
cation. We discuss their implementation and perfor- 
mance using the JMAS infrastructure. 


6.2 Factors That Limit Speedup 


A number of factors can contribute to limit the 
speedup achievable by a parallel algorithm executing 
in a network computing infrastructure such as JMAS. 
An obvious constraint is the size of the input pro- 
gram. If there is not enough work to be done by the 
number of processors available, then any parallel al- 
gorithm will not show an increase in speedup. Sec- 
ond, the number of process creations must be mini- 
mized. In particular, we are concerned with the cre- 
ation of remote actors throughout the distributed sys- 
tem. Lastly, in a network computing environment 
were communication cost is high, the number and 
packet size of inter-process communications must be 
limited. Table 1 shows the performance of two micro- 
benchmarks to calculate the execution time for com- 
munication sends, and remote class loading using the 
JMAS prototype. A micro-benchmark is a small ex- 
periment used to monitor the performance of underly- 
ing system operations. Results were obtained using a 
test packet to send a communication, and load a Java 
class file between two machines. 
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(Overhead [| secs _]] 
[Send |] -006-.070_]| 
| tenting | as-26_| 
Table 1. Micro benchmarks for a 10 Mbit 
Ethernet LAN using TCP sockets. 








In general, the total cost of distributing a program for 
parallel execution is defined as: 


Teost = TotalioadTimet+T otalcommTime tI otalezecTime 


Where TotaljoaaTime is the time to load the needed 
Java class files to each machine within the system, 
TotalcommTime is the time spent sending commu- 
nications between actors, and Totalezecrime is the 
time spent executing the fraction of the computation. 
Moreover, the total time to distribute the needed Java 
class files across N machines is: 


TotaljoaaTime = (N v7 1) * tload 


Where tyoaa is the average time to load the needed Java 
class files to one machine within the system. We as- 
sume that the machines are organized using a master- 
slave topology. Such that, the master is used to pro- 
cess a subcomputation, as well as, distribute N — 1 
subcomputations and receive the partial results from 
the other N — 1 slave machines. Assuming we dis- 
tribute the load evenly among N machines. Then the 
time to execute a fraction of the computation is: 


TotalexecTime = tseq /N 


Where tseq is the total sequential execution time for 
the application. Given the load distribution above, if 
each subcomputation sends at most k messages, then 
the communication overhead TotalcommTime can be 
defined as: 


TotalcommTime = (N = 1) *k * tsend 


Where tsena is the average time to send a commu- 
nication between two machines. Given N machines, 
we derive a general formula to define the total cost of 
distributing a program for parallel execution. 


Tcost(N) — (N —1) *tioaa + (N —1) *k*tsenattseqg/N 


Using the equation above, we can estimate the per- 
formance of a given application. As shown below, in 
order to benefit from parallelization the following in- 
equality must hold: 


Toost(N) < tseq 


(NV = 1) * tload + (NV = 1) *kx tsend + tsey/N < tseq 
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Solving the inequality, we find that the total cost (i.e. 
Tcost(N)) is less than the sequential execution time 
(i.e. tseq) for: 


N< tseq/ (tioaa +k teena) 


6.2.1 Remote Execution of Actors 


As a mobile actor computation unfolds, mobile actors 
have the ability to implicitly navigate autonomously 
throughout the underlying network; causing the mi- 
gration of code. On each of the experiments conducted 
in this chapter, we calculated the average time to load 
a Java class file over the network. On a standard 10 
Mbit Ethernet network the time to load a remote class 
file ranges between .15 and .28 seconds (Table 1). On 
average it takes .20 seconds to load a class file across 
the network. When considering distributing an ap- 
plication across several machines, one must take into 
consideration an upper bound on the amount of paral- 
lelism that can be exploited by distributing processes 
throughout a network computing system. In particu- 
lar, we focus on the overhead associated with loading 
Java class files across the network (i.e. TotaljoaaTime)- 
We can calculate the maximum number of machines p, 
needed to distribute the parallel computation without 
compromising the performance in speedup by finding 
the absolute minimum execution time for the contin- 
uous function T¢,s:(p) on a closed bounded interval 
[1, p]; where p = tseq/(tioad + k* tsena). Giving, 


T Cost (Dp) = tloaa + k * tsena — tseq |p? 


Setting T,,:(p) = 0 and solving for p, gives 


p= treo} (tioad +k tend) 


Therefore, we can estimate the maximum performance 
in speedup S as: 


= tseq/Tcost(p) 


Tost (p) =2* tload\/ tseg/ (tioaa +k teana) — toad 


Giving, 


Bim 2 * tseq Vv tseq/(tioad ae kx teena) + tseq 


4 tseq _~ (tioad + kx tsena) 


6.2.2 Message Passing 


As stated in Chapter 5, communication in JMAS is 
asynchronous, reliable and connection-oriented. Mes- 
sages between two actors, must be routed through a 
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D-RTM on the local machine on which the two ac- 
tors reside. The Java Virtual Machine requires all 
communication to go through the Java network layer 
(i.e. java.net) and the complete TCP stack of the un- 
derlying OS. This causes a substantial software over- 
head compared to communication libraries of paral- 
lel machines. Using JMAS, a single message can be 
sent from one actor to another within .006-.010 sec- 
onds on a standard 10 Mbit Ethernet LAN (Table 
1). As long as applications are coarse grained, the 
overhead of opening a socket connection can be ig- 
nored. Since message passing using Java TCP sockets 
is slow compared to dedicated parallel machines, and 
communication delays of large networks of heteroge- 
neous machines is unpredictable, only computation- 
intensive parallel applications benefit from the JMAS 
infrastructure. 


6.3 Traveling Salesman Problem 


Our first application is a parallel solution to the 
Traveling Salesman Problem (TSP). The Traveling 
Salesman Problem is as follows: given a list of n cities 
along with the distances between each pair of cities. 
The goal is to find a tour which starts at the first 
city, visits each city exactly once and returns to the 
first city, such that the distance traveled is as small as 
possible. This problem is known to be NP-complete 
(i.e. no serial algorithm exists that runs in time poly- 
nomial in n, only in time exponential in n), and it is 
widely believed that no polynomial time algorithm ex- 
ists. In practice, we want to compute an approximate 
solution, i.e. a single tour whose length is as short as 
possible, in a given amount of time. 


6.4 TSP Algorithm 


We take a naive approach to solving the TSP us- 
ing an Exhaustive-Search. The exhaustive-search al- 
gorithm searches all (n—1)! possible paths, while keep- 
ing the best path searched so far. We generate all pos- 
sible paths using a Perm() function on the number of 
cities n. The permutation function generates a lexi- 
cographical ordering of all possible paths. We divide 
the permutations equally among a set of processors 
p; such that each processor searches (n — 1)!/p pos- 
sible paths (Figure 12). Processors are arranged in a 
master-slave design. 
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Variable Definitions: 


n: Integer to denote the number of cities 
p: Integer to denote the number of machines 


mintour : Integer to denote the permutation of the best tour 
searche 


start: Integer to denote the starting permutation in lexicograph- 
ical order 


stop: Integer to denote the ending permutation in lexicographical 
order 


resultTour : Integer to denote the best tour search for a specified 
range lexicographically 


itself : Actor address of itself 
cust: Actor address to send result 
range: Integer to denote the total permutations (tours) to check 


Perm(i) : Generates the ith tour in lexicographical order 


behavior Slave : 


1. recy start, stop, and address of cust to send result 
2. mintour = start 


3. for i equal start to stop do 


if Perm(i) distance < Perm(mintour) distance 


set mintour to i 


4. send mintour to cust 


behavior Master : 


1. mintour = 0 
2. range = (n — 1)!/p 


3. for each processor i : 1 to p— 1 do 


create a Remote actor assume behavior Slave, return ad- 
dress of actor as x 


send start = (i*range), stop = ((i+1)*range), and the address 
of itself to z 

4. become itself and wait for p results 

5. fori: 1 to p do 


receive resultTour 


if Perm(resultTour) distance < Perm(mintour) dis- 
tance 


set mintour to resultTour 


Figure 13. TSP Algorithm. 


6.4.1 Measurements 


In order to complete our set of measurements in a 
reasonable amount of time we chose to test our TSP 
solution primality for N = {4,5,10,13} cities. We 
conducted the experiment in an environment consist- 
ing of up to 15 machines, and compared the results 
with a sequential application running on a SPARC 20 
workstation. As shown in Figure 13, there is no signif- 
icant gain in performance for N < 10. This is due to 
the overhead associated with loading Java class files 
across the network. Figure 14 displays the execution 
time of a TSP solution for N = 5 versus its remote 
Java class loading time. As the number of machines 
p increase, the load time increases, causing the exe- 
cution time to increase; exceeding the execution time 
for a sequential solution. Notice we achieve the best 
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performance for p = 4 machines. For N > 10, our 
TSP solution gives a better performance. In partic- 
ular, for N = 13 the speedup obtained is close to 
linear. Due to limited resources, we were unable to 
test the scalability of the application for large values 
of p. We estimate the performance of our TSP appli- 
cation using Equations 1,2 and an average load time 
tioad = -15 secs. As illustrated in Table 2, the aver- 
age CPU utilization for the best possible number of 
machines p is 50%. This is because, as the number of 
processors p approach (NV — 1)!, the speedup obtained 
will decrease significantly; due to under utilization of 
processors and the overhead associated with loading 
Java class files across the network (Figure 15). The 
estimates are also reflected in Figure 13. These results 
show that our framework is well suited for course grain 
applications. The TSP application also scales well to 
large computation sizes (Figure 16). 


meena egg ee eee eee eee 
a et 
[N=10 Cities || 24.aar [13 


Hea — 33-7] 
N= 13 Cities | 36655.848 [f 494 [| 247.42 fT 50% 


Table 2. 





Estimating the Performance of 
TSP. 


MAS Peformance d TSP 


Speeds 





Figure 13. Speedup of TSP. 
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Figure 14. Execution Time vs Load Time. 


126 5th USENIX Conference on Object-Oriented Technologies and Systems (COOTS '99) 


MAS Performance of TSP 








JAAS Pertornance od TSP 





Figure 16. 


Scalability of TSP. 


6.5 Mersenne Prime Application 


For our second application, we implemented a 
parallel primality test which is used to search for 
Mersenne prime numbers [19]. This type of applica- 
tion is well suited for our infrastructure. It is very 
coarse grained with low communication overhead. 

A Mersenne prime is a prime number of the form 
2? — 1, where the exponent p itself is prime. These 
are traditionally the largest known primes. Encryp- 
tion and decryption methods are typical applications 
which utilize large prime numbers. Searching and veri- 
fying Mersenne primes using computer technology has 
been conducted since 1952 [19]. To date 37 Mersenne 
primes have been discovered. Only up to the 35th 
Mersenne prime has been verified. The current record 
holder is 21398269 _ 1 and was discovered through 
the use of over 700 PCs and workstations worldwide. 
With larger and larger prime exponents, the search for 
Mersenne primes becomes progressively more difficult. 


6.5.1 Mersenne Prime Algorithm 


In our implementation, each prime is tested based on 
the following theorem: 
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Lucas-Lehmer Test: For p odd, the 
Mersenne number 2? — 1 is prime iff 2? — 1 
divides S(p—1); where S(n+1) = S(n)? —2, 
and S$(1) = 4. The proof can be obtained 
from [19]. 


We develop a mobile actor program to test for 
Mersenne primality, given a range of prime numbers 
(Figure 18). Processors are arranged in a master-slave 
design. As shown below, our application works as fol- 
lows: 


Given N machines and a range r of prime 
numbers, we divide the search such that each 
machine tests for a Mersenne prime using 
the Lucas-Lehmer Test for a range of primes. 
Each range is of size r/N. 


Variable Definitions: 


r: Integer to denote the amount of primes to test 

N:: Integer to denote the number of machines 

Lucas(z) : Performs Lucas-Lehmer test on x 

itself : Actor address of itself 

cust: Actor address to send result 

range: Integer to denote the range of primes to check 

start: Integer to denote the starting prime number 

stop : Integer to denote the prime number used as a sentinel 
rec¥count : Integer to denote the total results received 


PRIME : enumerator returned from Lucas(x); if z is a prime 
number 


SINK : message to denote the termination of a subcomputation 


behavior Slave : 


1. recv start, stop, and address of cust to send result 
2. for i: start to stop do 
if Lucas(i) is PRIME 
send i to cust 


3. send SINK to cust 


behavior Master : 


1. range = r/N 


2. for each processor i: 1 to N —1 do 


create a Remote actor assume behavior Slave, return ad- 
dress of actor as = 


send start = (i*range), stop = ((i+1)*range), and the address 
of itself to = 

3. become itself and wait for N results 

4. set recvcount = 0 

5. receive result 

6. if result is SINK 
increment recvcount by 1 

Else 

print "27¢S"lt _ 1 is PRIME!” 

7. if recvcount < N, then goto & 


Figure 17. Mersenne Prime Algorithm. 
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6.5.2 Measurements 


For our measurements, we chose to test the Mersenne 
primality for all exponents between 4000 and 5000. 
Known primes within this range are 247° — 1 and 
24423 _ 1. The reason for selecting this range is that: 


1. we tried to make the number large enough to sim- 
ulate the true working conditions of the applica- 
tion, 


2. we wanted to keep them small enough to be able 
to complete our set of measurements in a reason- 
able amount of time. 


We conducted the experiment in an environment con- 
sisting of up to 15 machines, and compared the results 
with a sequential application running on a SPARC 20 
workstation. As shown in Figure 18, our application 
scales to 15 machines linearly. The speedup obtained 
is slightly lower than linear speedup. This is because 
we decompose the range of primes to be tested un- 
evenly in terms of the amount of work to be done. 


JAAS Performance of Mersemne Prime Apglicaton 





8 
Processor p 


Figure 18. Speedup of Mersenne Prime. 


For instance, testing if 24°°° — 1 is prime, can be done 


much faster than testing if 2*9°° — 1 is prime. We split 
the ranges in groups such that, the last machine re- 
ceives the last group consisting of the largest numbers. 
Due to limited resources, we were unable to test the 
scalability of the application for large values of p. We 
estimate the performance of the Mersenne Prime ap- 
plication using Equations 1,2; where the average load 
time tioag = -20 secs, and the average sequential ex- 
ecution time tseg = 83432 secs. As shown in Table 
3, results show that the application scales up to 646 
machines with an overall speedup of 323. From our 
results we can assume that for p > 646, the range 
of primes to test decreases causing under utilization 
of CPUs (Figure 19). Also, for every new machine 
added, the time to load Java class files increases caus- 
ing a decrease in performance. 
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[Application [| tacq secs [| Max. p | Max. S [| Utilization | 





[[[Meraenne Prime |] 63432 || 323 || 646 | 50% 


Table 3. Estimating the Performance of the 


Mersenne Prime Test. 


MAS Performance of Mersemne Prime 


‘Laizaton 


CPU Utilization of Mersenne 
Prime. 


Figure 19. 


7 Conclusion 


In this paper we discuss the design and imple- 
mentation of a prototype network computing system 
(JMAS) based on the mobile actor model [10] us- 
ing Java technology [25]. JMAS requires a program- 
ming style different from commonly used approaches 
to distributed computing. JMAS allows a program- 
mer to create mobile actors, initialize their behav- 
iors, and send them messages using constructs pro- 
vided by the JMAS Mobile Actor API. As the com- 
putation unfolds, mobile actors have the ability to 
implicitly navigate autonomously throughout the un- 
derlying network. New messages are generated, new 
actors are created, and existing actors undergo state 
change. We evaluate the performance of our system 
using two benchmarks: a Mersenne Prime Applica- 
tion, and the Traveling Salesman Problem. The de- 
gree of parallelism obtained from distributing mobile 
actors throughout the system is limited due to the 
overhead associated with migrating Java class files, 
and the amount of inter-process communication. In 
particular, we are bound by the number of processors 


p= O( tseq/(tioaa + kx tsena) |) 
to distribute the parallel computation; where tseq is 


the sequential execution time of the application, tjoaa 
is the average time to load the needed Java class files 
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to one machine, k is the total message rounds sent 
per machine, and tseng is the average time to send a 
communication between two machines. Given p we 
can estimate the speedup S as: 


i= tseq/Tcost(p) 


Where the enhanced performance using p machines, is 
denoted as a general formula 


Tcost(p) = (p — 1) * tload 7 (p - 1) *kx tsend + bag lV 


Our estimates for the TSP and Mersenne Prime ap- 
plications, show that each application scales to large 
numbers of machines N. But for N > p, we estimate 
a decrease in performance; due to the under utiliza- 
tion of CPUs, and the significant overhead associated 
with loading the needed Java class files and sending 
communications throughout the system. These results 
show that our framework is well suited for course grain 
applications. 


7.1 Future Work 


Issues such as fault tolerance and security need 
to be addressed and implemented within the JMAS 
framework. Also, experiments concerning the scal- 
ability of the JMAS framework to support internet 
(global) computing will be conducted in future work. 
Support for high-level communication abstractions 
will be addressed within the JMAS Mobile Actor API. 
Examples are barrier actors, mutex actors, call/return 
communication, and actorSpaces [2]. 
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Abstract 


Mobile agents as a new design paradigm for dis- 
tributed computing potentially permit network ap- 
plications to operate across dynamic and _ hetero- 
geneous systems and networks. Agent computing, 
however, is subject to inefficiencies. Namely, due 
to the heterogeneous nature of the environments in 
which agents are executed, agent-based programs 
must rely on underlying agent systems to mask some 
of those complexities by using system-wide, uniform 
representations of agent code and data and by ‘hid- 
ing’ the volatility in agents’ ‘spatial’ relationships. 


This paper explores runtime adaptation and agent 
specialization for improving the performance of 
agent-based programs. Our general aim is to enable 
programmers to employ these techniques to improve 
program performance without sacrificing the funda- 
mental advantages promised by mobile agent pro- 
gramming. The specific results in this paper demon- 
strate the beneficial effects of agent adaptation both 
for a single mobile agent and for several cooperat- 
ing agents, using the adaptation techniques of agent 
morphing and agent fusion. Experimental results 
are attained with two sample high performance dis- 
tributed applications, derived from the scientific do- 
main and from sensor-based codes, respectively. 


1 Introduction 


Mobile agent[4, 14, 18, 42] as a new design paradigm 
for distributed computing potentially permit net- 
work applications to operate across heterogeneous 


systems and dynamic network connectivities, to re- 
duce their bandwidth needs, and to avoid overheads 
caused by large communication latencies. In addi- 
tion, mobile agent systems{11, 20, 24, 38] are de- 
signed to facilitate the construction of distributed 
programs that have the flexibility to adapt their op- 
eration in response to the heterogeneous nature of 
or dynamic changes in underlying distributed com- 
puting platforms. 


Agent computing, however, is subject to several in- 
efficiencies. Some of these inefficiencies are caused 
by the complexities of the environments in which 
mobile agents are deployed. Such environmental 
complexities include heterogeneity in architectures, 
communication networks (both at the hardware and 
protocol levels), operating systems, and agent man- 
agement systems. ‘This diversity requires agent- 
based programs to rely on underlying agent systems, 
most of which are based on interpreted languages 
like Java and Tcl/Tk[10, 26], to mask some of these 
complexities, by using system-wide, uniform repre- 
sentations of agent code and states to store, trans- 
port and execute agent programs. Additional inef- 
ficiencies in agent computing are caused by the dy- 
namic nature of agent-based programs, where differ- 
ent components of these programs exhibit volatile 
‘spatial’ relationships. Such ‘spatial’ volatility re- 
sults from agents’ mobility and from the runtime 
service/agent discovery schemes being used. The 
underlying agent systems ‘hide’ this volatility by en- 
suring that remote agent invocations are directed to 
current agent execution sites. 


There has been considerable work on dealing 
with inefficiencies in agent computing, including 
the development of Just In Time (JIT) compil- 
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ers for agent code[23], of methods for creating 
efficient Java programs[37], and of performance 
tuning techniques and tools for distributed agent 
applications[15]. These efforts are particularly rele- 
vant to performance-constrained distributed appli- 
cations, such as data mining, where large amount of 
states may have to be moved upon discovery, and in- 
teractive simulations, where application must offer 
real-time performance to end users [12, 21]. 


Our research is exploring two approaches for im- 
proving the performance of distributed, agent-based 
programs: (1) runtime adaptation and (2) agent 
specialization. The general aim of this work is to en- 
able programmers to employ these techniques to im- 
prove program performance without sacrificing the 
fundamental advantages promised by mobile agent 
programming. This paper explores the effects of us- 
ing two specialization approaches: agent morphing 
on a single mobile agent and agent fusion on multi- 
ple cooperating agents. 


The remainder of this paper is organized as follows: 
Section 2 presents two applications that can benefit 
from the use of agent technologies while also requir- 
ing levels of performance not easily attained with 
current agent systems. interactive continuously Sec- 
tion 2 also describes the performance implications 
of using agent- vs. compiled object-based repre- 
sentations of the software components involved in 
interactive data viewing. Based on these evalua- 
tions, Section 3 next describes two agent specializa- 
tion techniques — morphing and fusion — that ad- 
dress some of the runtime performance problems 
of applications like these. Significant performance 
gains are demonstrated from applying these tech- 
niques to the aforementioned applications for a va- 
riety of typical scenarios of use. Based on the im- 
provements demonstrated in this section, Section 4 
then describes next steps in our research, includ- 
ing the design of runtime support in which various 
adaptation techniques are easily applied. The paper 
concludes with a discussion of related research (see 
Section 5), conclusions, and future work (Section 6). 
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2 Using Agents with High Performance 
Applications 


2.1 Mobile Agents and High Performance 


Advances in the processing and communication ca- 
pabilities of today’s computer systems make it possi- 
ble to wire heterogeneous and physically distributed 
systems into computational grids[8] that are able to 
run computation- and communication-intensive ap- 
plications in real time. Consequently, end users are 
encouraged to interact with their applications while 
they are running, from simply inspecting their cur- 
rent operation, to ‘steering’ them into appropriate 
directions[27]._ Examples of such applications in- 
clude teleimmersion, interactively steered high per- 
formance computations, data mining, distributed 
interactive simulations, and smart sensors and in- 
struments [12, 21, 41, 44]. 


Data in such applications comes from sources like 
sensors, disk archives, network interfaces, and other 
programs, is transformed while passing through the 
computational grid, and is finally output into sinks 
like actuators, storage devices, and the user inter- 
faces employed by interactive end users. Applica- 
tion interfaces also permit applications to be re- 
configured on-line in response to explicit user re- 
quests or to changes in user behavior. Sample re- 
configurations include the creation or termination of 
certain application components, component replica- 
tion, changes in dependencies between components, 
and changes in the mapping of components to com- 
putational grid elements. 


Our aim is to use mobile agents to implement some 
of the data processing tasks of interactive high per- 
formance applications. More specifically, while it is 
unlikely that a high performance simulation like a 
fluid dynamics[39] or a finite element code will em- 
ploy mobile agents for the simulation itself, it is de- 
sirable to represent as agents many of the computa- 
tions and data transformations required for their in- 
teractive use. Such representations enable end users 
to interact with their long running simulations from 
diverse locations and machines (e.g., when working 
from home), and they permit the appropriate place- 
ment of data transformations such that data reduc- 
tions are performed where most appropriate (e.g., 
before sending data to a weakly connected machine 
located in an end user’s home). Our efforts are sup- 
ported by several recent developments, including the 
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creation of agent-based visualization and collabo- 
ration tools for high performance computations[13, 
40]. 


Our second aim is to freely mix the use of agent- 
vs. compiled object-based representations of data 
transformation tasks, such that end users need not 
be aware of the current task representations and 
such that changes in task location and represen- 
tation are made in response to current user be- 
havior and needs. This paper presents our design 
ideas and initial implementation concerning a mixed 
agent/object system. This work is based on recent 
work elsewhere on object or agent specialization[28] 
and by our own work on object technologies for high 
performance and interactive parallel or distributed 
programs|6, 35, 36]. 


The remainder of this section describes and evalu- 
ates two applications that are representative of in- 
teractive scientific programs and sensor processing 
applications, respectively. The application drawn 
from the scientific domain, termed Interactive Ac- 
cess to Scientific Data (ISDA), has performance con- 
straints due to the amounts of data being manip- 
ulated and displayed, regardless of end users’ lo- 
cations. The other application’s performance con- 
straints are derived from the necessity to process 
data in real-time or at certain rates, while sensor 
(source) and sink locations may change. This ap- 
plication, termed Parallel Scalable SAR Processing 
Simulator (PSSPS) is derived from the standard 
SAR (Synthetic Aperture Radar) benchmark origi- 
nally developed at Lincoln Labs[46]. 


2.2 Sample Applications: Interactive Sci- 
entific Data Access (ISDA) and Par- 
allel Scalable SAR Processor Simu- 


lator (PSSPS) 


Both the ISDA and PSSPS applications are stream- 
based, driven by multiple inputs (stream sources) 
and able to service any number of end points 
(stream sinks). The purpose of ISDA (Figure 1) 
is to enable human end users to view and steer a 
high performance simulation (a global atmospheric 
model acting as a data source) via visualizations 
of the model’s output data (ie., the stream sinks). 
The model simulates the transport of chemical com- 
pounds through the atmosphere. It uses assimilated 
windfields derived from satellite observational data 
for its transport calculation, and known chemical 


concentrations also derived from observational data 
as the basis of its chemistry calculations. 


Of interest to this paper are the ancillary compu- 
tations that ‘link’ the atmospheric model itself to 
various visualizations, where the set of these ad- 
ditional computations is depicted as a ‘cloud’ con- 
necting the simulation to its inputs/outputs in Fig- 
ure 1. Most such ‘cloud elements’ implement trans- 
formations that prepare model data for viewing by 
end users, e.g., reducing the amount of model data, 
transforming model data from its model-internal to 
a user-viewable representation, etc. Other ‘cloud el- 
ements’ perform additional computations like com- 
paring model outputs with satellite observational 
data. 


The specific cloud elements used in our work are (1) 
a regression model with which statistical tests may 
be performed on selected data, (2) a specialized data 
reduction code that ‘clusters’ scientific data[29] as 
per end user needs, (3) the Spectral-to-Grid trans- 
former(s) that transforms the simulation model’s in- 
ternal data representation to the grid-based repre- 
sentation suitable for data visualization, and (4) the 
Isosurface calculator(s) that computes volumes of 
data of interest to end users and also generates the 
graphical primitives based on which this data may 
be viewed (an example of data of interest to end 
user is data volumes in which certain chemical con- 
stituents have equal levels of concentration, and an 
example of generated graphical primitives are the 
triangular representations of data suitable for the 
OpenGL rendering commands used in the 3D data 
visualization). 


For the ISDA application, the argument for using 
agent-based representations for selected ‘cloud’ ele- 
ments is apparent from the fact that the visualiza- 
tion engines themselves may have agent-based rep- 
resentations, as in the case of VizADJ[40], in addi- 
tion to object-based representations such as the SGI 
OpenInventor-based visualizations described in [30]. 
Specifically, when end users work in laboratories, 
they are likely to use high end machines capable of 
running the OpenInventor-based visualizations in- 
teractively in order to inspect model data in detail. 
When end users are simply ‘looking a colleague over 
the shoulder’ with the collaborative interfaces used 
in our work, then they require less detailed infor- 
mation and are likely to use ubiquitously runnable 
visualization tools like VizAD that enable such col- 
laboration across a large diversity of machines and 
locations. 
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Figure 1: Computational components in ISDA. 


Consequently, the associated data transformations 
may need to change data representations repeat- 
edly, first from the model’s internal spectral form 
of data to grid form, second from grid form to de- 
scriptions that may be rendered graphically, as ex- 
emplified by an Isosurface calculator. This trans- 
former may reside as an agent on the same machine 
as the agent-based VizAD visualization or it may 
reside on a remote machine and operate as a spe- 
cialized data reduction engine if the VizAD visual- 
ization is run on a weakly connected machine, such 
as a laptop or a computer located in a user’s home. 
Clearly, the suitable choices of representation and 
the locations of agent- or object-based cloud ele- 
ments depend on many factors, including current 
user needs and computing platform characteristics. 
The results presented below represent a first step to- 
ward automating choices like these, as they demon- 
strate the tradeoffs in performance when different 
element representations are used. 


In PSSPS (Figure 2), data is either synthesized on- 
line or read from disk files that contain radar im- 
ages. The processing of this data is performed by 
cloud components that include (1) selectors that fil- 
ter out uninterested frames, (2) FIR filters for video 
to baseband I/Q conversion, (3) range compression 
units for pulse compression, and (4) Azimuth Com- 
pression Units for cross-range convolution filtering. 
Convolution results are the output strip-map im- 
ages used for visualization. The implementation of 
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PSSPS used in our work exhibits both the pipeline 
parallelism similar to that of the ISDA application 
and additional parallelism internal to pipeline stages 
that are able to utilize it, as is exemplified by the 
data parallel processing of the convolution stage for 
the purpose of speeding up this process. 


In the PSSPS application, agent representations 
are useful for sensors in remote or mobile locations 
and/or for end users who wish to understand sen- 
sor data from mobile or remote locations. One ex- 
ample is a battlefield where radar data should be 
made accessible in some form to mobile units in the 
field operating at locations remote from the radar it- 
self, only when such operational capabilities are cur- 
rently required, whereas more permanent processing 
is installed and operated at regional or global com- 
mand sites. This implies the need for flexibility in 
the location and execution of agent-based SAR com- 
putations associated with data sources and sinks. 


2.3 Tradeoffs across Alternative Program 
Representations 


The basic performance problems arising from the 
use of agent vs. compiled object representations of 
ISDA or PSSPS components are well understood. 
Usage of interpreted languages, such as Java[38, 20, 
24], is a major cause of these problems as is depicted 
by experiment results listed in table 1. These ex- 
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Figure 2: Structure of PSSPS. 


periments use the Sun Solaris native C compiler to 
generate compiled code and use JDK1.2-beta3 pack- 
age for the Java compiler and runtime environment 
(including the JIT compiler used in our later exper- 
iments). The platforms used in the experiments are 
the Sun Ultra-Sparc 30 uniprocessor systems with 
Solaris 2.5.1 and 100MB FastEthernet interconnec- 
tion. 


Specifically, these measurements demonstrate that 
for an application component like PSSPS’ Azimuth 
computation, which has a large amount of compu- 
tationally expensive floating point operations, the 
Java code realization runs almost 10 times slower 
than compiled code implemented with C. And simi- 
larly, for an application component like ISDA’s Iso- 
surface generation back-end, the Java code imple- 
mentation on average takes 17 times more time than 
its native counterpart for a random set of grid data. 


Agents java | compiled 
[| *8 | cose | Scone | | 
[Azimuth processing | 30,992 | 2,048 | 10.515 | 
[Lisosurface backend | 8348 | 461 | 18.11 | 










Table 1: Comparing the performance of Java vs. 
native code(in msecs except for ’ratio’). 


In comparison with Table 1, the measurements pre- 
sented next demonstrate the utility of JIT compil- 
ers for Java, which constitutes one way in which 
agent-based programs and their runtime environ- 
ment may be specialized for efficient execution on 
target machines that have such compilers available. 
Specifically, Table 2 shows that with JIT, Java re- 
alizations of application components are from more 
than 3 times (for Isosurface back-end) to close to 
5 times (for Azimuth processing) faster than those 
without JIT. This table also shows that static op- 
timizations done by compilers vary in their effec- 


tiveness and that Java inter-class optimization does 
not much affect either of the two application com- 
ponents. 


The performance improvements demonstrated in Ta- 
ble 2 might be sufficient for some applications. How- 
ever, for applications like PSSPS and ISDA, their 
scalability and utility for large-scale data sets and 
for realistic execution rates would be compromised 
substantially by the fact that their JIT-based Java 
representations are 70% (in the case of Azimuth pro- 
cessing) to 300% (in the case of Isosurface back- 
end) slower than native code. However, perhaps 
even more important is the fact that significant ad- 
ditional overheads exist for distributed agent-based 
programs in which multiple agents must cooperate 
remotely, as is the case for both the ISDA and PSSPS 
applications. Agent-based high performance sys- 
tems obviously need efficient communication mech- 
anism to facilitate cooperation, which may involve 
large amounts of data, among agents. Unfortunately, 
our third experiment shows that Java RMI[43], which 
is being used for agent communication in many of 
the Java-based agent systems, has overheads which 
limit these applications’ scalability in the presence 
of intensive agent communication. 


Our experiment uses three components of PSSPS: 
Fir filtering, Range processing and Azimuth pro- 
cessing, to construct three pipelines with length of 
0(Azimuth only), 1(Azimuth and Range), and 2(all 
three components). Our agent implementation uses 
Java and RMI, while the compiled object implemen- 
tation uses C and OTL. OTL is the object invo- 
cation layer of the COBS CORBA-compliant ob- 
ject infrastructure developed at GT for high perfor- 
mance object-based programs[36, 7]. OTL is built 
on top of TCP and can perform object invocation 
across heterogeneous platforms. 
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1,812 1,819 


Table 2: Effects of using JIT compilation(in msecs except for ’ratio’). 


‘The experimental results depicted in Table 3 demon- 
strate the importance of RMI performance for even 
the computationally intensive applications. With 
increased pipeline lengths, the relative performance 
of the compiler optimized, JIT-enabled Java repre- 
sentation over that of the compiled code using our 
efficient distributed object infrastructure gets pro- 
gressively worse. We suspect that this performance 
problem of Java RMI is due to marshaling overhead 
(object creation, stream creation) and to threading 
and synchronization costs. There has been research 
on improving the performance of Java RMIs{19], but 
this has not yet resulted in improvements to the 
standard Java distribution. 


pipeline | java | native | ratio 
[0 523_| 2,948 [1.772 | 
1 | 23,095 [4,968 | 4.640 | 
[2 [34.404 | 6,054] 5.685 | 










Table 3: JIT’s effectiveness for pipelined applica- 
tions(in msecs except for ’ratio’). 


The measurements depicted above demonstrate the 
need for additional optimizations of agent-based rep- 
resentations of distributed programs if they are to 
be used to implement high performance applica- 
tions. One basic issue, we believe, is that current 
JIT-based optimizations lack information about the 
operation and behavior of these distributed appli- 
cations that can be exploited to further improve 
their performance. Specifically, first, if the dura- 
tion of an agent’s operation is known, then it be- 
comes feasible to morph at runtime an agent-based 
representation to one using native code, invisibly to 
end users and using techniques like cross-platform 
binary code generation or access to code reposito- 
ries. The technique assumed in this paper relies on 
the presence of code repositories[2]. Second, if mul- 
tiple agents residing and cooperating on one ma- 
chine could be ‘compiled’ as if they were one unit, 
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then this compilation could use global knowledge 
not accessible to either method-based JIT compila- 
tion or component-based Javac compile-time opti- 
mization, thereby able to address both intra- and 
inter-component (e.g., RMIs) performance issues. 
Such global properties are considered by the agent 
fusion technique explored in this paper, which com- 
bines multiple agents into single, more powerful and 
potentially, more efficient agent representations. 


Morphing, fusion, and their application to the ISDA 
and PSSPS distributed programs are discussed in 
more detail next. 


3 Techniques for High Performance 
Agent Realization 


3.1 Agent Morphing 


Morphing Concepts. One specialization mech- 
anism proposed in this paper is morphing, which 
means changing the form of a mobile agent to adapt 
to the specific platform on which it is currently run- 
ning. Namely, each agent may have two forms: 
a platform independent form — henceforth termed 
neutral form — and a platform dependent form — 
henceforth termed native form. The agent is pro- 
grammed to be able to morph between these two 
forms, using some of the techniques exposed in Sec- 
tion 3 below. 


This paper establishes the importance of agent mor- 
phing and describes the internal structure of mor- 
phable agents. Briefly, we assume that all agents 
start with their neutral forms, which implies that 
they have no architectural knowledge of the hosts 
when they are deployed; this also facilitates the dy- 
namic introduction of new architectures into the 
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system. Once started, the agent will spawn a low 
priority thread to acquire its native implementation, 
and if such a native form exists, the agent can then 
switch from its former agent mode into native mode 
whenever deemed necessary. Such mode switching 
can occur either 


e immediately after the agent has acquired its 
native form, 

e when a morphing instruction in the program 
is encountered, 

e or in response to end user request or to events 
generated by the quality of service manage- 
ment system(33] that controls agents/objects’ 
efficient execution. 


During mode switching, the system transforms and 
copies the agent-mode states into its native-mode 
representation. Such transformation and copying 
are platform-dependent, and are carried out by a 
set of functions within the native implementation 
(for our experiments, in a JNI module). This func- 
tion set can either be user- or system-defined. In 
the latter case, the application programmer has to 
define an interface describing the data fields that 
need to be transformed and copied when morphing 
is performed. This interface has to be written in 
both native (in our case C) and agent (Java) code. 
System-provided state transformation functions rely 
on these interface definitions; they also rely on ob- 
ject reflection techniques to achieve transformation 
and copying. 


An agent may also morph back from its native form 
to its neutral form, which happens when the agent 
decides to migrate. In this case, the agent first 
transforms and copies its native states into neutral 
states, then switches back into agent mode, and 
finally migrates, with the help from the underly- 
ing agent system[38]. In general, morphing may 
be triggered at any point during agent execution 
in response to externally generated events or by the 
agent itself in response to internal state changes. 
However, after morphing, an agent has to restart 
from a fixed entry point, which essentially requires 
an agent to record its (application) state prior to 
morphing. 


The morphable agents designed and used with the 
ISDA and PSSPS applications described in this pa- 
per utilize two key abstractions: (1) invocation adap- 
tors and (2) events. 


The purpose of the adaptor is like that of the poli- 


cies associated with objects described in [9, 16, 35]: 
it intercepts all incoming invocations to the mor- 
phable agent, ‘translates’ them to the form appro- 
priate for the agent’s current representation (neu- 
tral or native), and then directs the invocations to 
this representation’s implementation. Each agent 
uses a native form adaptor and a neutral form adap- 
tor at the same time, so that invocation clients of 
the agent can invoke the agent regardless of their 
current states. The system we are constructing as- 
sumes that each agent is morphed in its entirety, 
either residing in its native or its neutral state; this 
eliminates problems with partial state translation 
and state consistency when state is accessed simul- 
taneously by native and neutral method implemen- 
tations. 


The purpose of events is to provide a uniform man- 
ner in which morphing is initiated, in response to 
the receipt of events that are internal or external 
to the agent. Internal events may be raised when 
certain state changes occur; external events may be 
raised by other agents or by a resource management 
system that has global knowledge of the agent pro- 
gram’s behavior. 


Application of Morphing to Sample Applica- 
tions. Sample morphable agents have been con- 
structed with Java, where specialized (morphed) ver- 
sions of these agents are also available as native code 
for SUN Sparc/Solaris machines. The performance 
benefits of agent morphing have already been pre- 
sented in tables 1 and 2, when applied to the Isosur- 
face and Azimuth transform agents. In these partic- 
ular examples, morphing overheads only come from 
native library loading and minimal application state 
translating and copying and are thus almost negligi- 
ble. However, we expect such overheads to be higher 
when a code repository server is involved and/or 
when the amount of shared data is significant and 
morphing is applied more frequently. These agent 
realizations utilize the adaptors described in this 
section, using internally generated events. We man- 
ually program adaptors in our sample applications, 
but we believe that such adaptors can be readily 
generated by a compiler from IDL files. 


The conditions under which morphing is applied are 
straightforward. In each example, when the amount 
of data being processed by agents increase (e.g., 
the ISDA application’s visualization wishes to view 
more data or the PSSPS application’s image resolu- 
tion is increased), then the agent implementations 
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of these transformers do not deliver suitable per- 
formance. This fact is detected by inspection of 
internal data buffer fill levels'. To speed up data 
processing, the agent first acquires its native rep- 
resentation, then invokes an application-provided 
state-checkpointing method (if such a method was 
defined by the user), then transforms its state us- 
ing the function set provided either by the system 
or the application programmer, and finally, initiates 
execution of its native form from a fixed entry point. 


Discussion. Advantages of morphing include the 
performance improvements demonstrated in this sec- 
tion and also potential improvements concerning the 
predictability of agent execution. Predictability is 
particularly important for real-time and embedded 
applications and because Java code execution times 
are believed to be difficult to predict due to in- 
terpretation and garbage collection[25]. Difficulties 
with morphing arise from two sources. First, if na- 
tive implementations cannot be acquired from a ma- 
chine’s local file system, then morphing overheads 
may become large due to the costs of access to re- 
mote repositories. Second, increased generality and 
complexity of native code compared to the sample 
agents used in our work may make morphing in- 
feasible and/or require the provision of additional 
mechanisms to enable morphing. For example, with 
the sample applications and with the object real- 
izations used in our research, agent safety may be 
guaranteed due to the sample objects’ relative lack 
of internal complexity (e.g., no object-initiated file 
accesses) and due to the object system implementa- 
tion’s lightweight nature and heavy use of libraries. 
For general CORBA- or DCOM-based object im- 
plementations, guaranteeing improved performance 
or predictability as well as safety will require ad- 
ditional effort. Furthermore, internal native states 
like ‘open file descriptors’ cause problems for mor- 
phing not easily addressed for future agent systems 
(see [5] for a more detailed discussion of this topic). 


3.2 Agent Fusion 


Fusion Concepts. Fusion is useful for distributed 
agents that communicate within and across different 
machines. For closely coupled cooperating agents, 
communication overheads can constitute significant 


1 More sophisticated techniques for first detecting and di- 
agnosing performance problems with pipeline-structured ap- 
plications are described in [22, 34] 
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portions of their operating costs. For example, in 
the sample applications described above, when an 
‘upstream’ agent filters out much of the data, then 
the remaining data handed off to the downstream’ 
agent may not result in significant communication- 
based overheads. This is the case for the ‘statis- 
tics’ agent operating on the atmospheric data in the 
ISDA application, for example. Conversely, when 
such filtering does not remove much data, commu- 
nication overheads may be substantial, as evident 
for many of the PSSPS application’s agents. 


The intent of agent fusion is to remove communi- 
cation overheads from collaborating sets of agents 
and to enable optimizations across agent bound- 
aries. Briefly, these overheads include actual net- 
work transport and protocol processing times, data 
copying costs, and thread/process scheduling and 
context switching costs that arise when tasks are 
performed by multiple vs. single agents residing on 
the same machine. Possible optimizations across 
agent boundaries are similar to those performed 
by compilers across procedure boundaries, includ- 
ing inter-procedural analysis leading to code or data 
motion and procedure integration where multiple 
overlapping procedures are combined into a smaller 
number of more efficient, combined procedures[3, 
31, 32]. One particularly unfortunate communica- 
tion cost is that incurred by multiple cooperating 
agents residing on the same machine, where their 
shared location is due to unforeseen agent migration 
actions (e.g., both agents ‘found’ interesting data 
on the same machine, or both agents moved to that 
machine due to local resource availability). In this 
case, it is clear that such agents would operate much 
more efficiently if they were placed into the same 
address space and used shared user-level threads, as 
this would reduce agent invocation costs to the costs 
of a few procedure calls (via adaptors). It would 
also eliminate overheads associated with the imple- 
mentation of asynchronous invocations, such as the 
use of additional threads and their scheduling and 
synchronization, and additional data copying due to 
asynchrony and kernel/user space crossings. 


In summary, while agent morphing specializes indi- 
vidual mobile agents, fusion performs runtime opti- 
mizations across multiple, cooperating agents. Fu- 
sion may be applied repeatedly, to create single ef- 
ficient agents from multiple collaborating agents. 


Application of Fusion to Sample Applications. 
Agent fusion is applied to those components in the 
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Figure 3: A scenario for fusion of Isosurface calculation front-end and back-end. 


sample applications for which co-location on the 
same machine and with the same internal form (na- 
tive or neutral) is likely to occur. For instance, Iso- 
surface calculations in the ISDA application may be 
performed outside the scope of the visualization en- 
gine (if the visualization runs on a remote or weakly 
connected machine) and jointly with visualization. 
In fact, earlier versions of ISDA always performed 
Isosurface calculation within the visualization agent 
itself, as the visualization was running on a strongly 
connected and high end visualization engine (i.e., 
an SGI Octane). The versions of ISDA now used 
by end users do not wish to use such an instance of 
the visualization agent due to their desire to operate 
across a wider spectrum of machines and service a 
larger number of end users. 


Figure 3 depicts the situation in which fusion is car- 
ried out on the Isosurface front-end and back-end, 
where the back-end does the actual isosurface cal- 
culation and then sends the computed isosurfaces 
as data to front-end on which visualization agent 
runs. Originally, there are multiple Isosurface front- 
ends, and the back-end is not co-located with any of 
the front-ends, with the intent of minimizing overall 
communication cost and avoiding burden the visual- 
ization engines with isosurface calculations. When 
all front-ends except one complete their execution, 
then the back-end might migrate to the remaining 
front-end’s location, in order to reduce the commu- 
nication cost between the two parties. The system is 
then able to fuse the two agents to further enhance 
performance. Table 4 shows the potential perfor- 
mance improvements resulting from such a fusion 


action. The same table also lists the results of fus- 
ing two agents in the PSSPS application: fir filtering 
and range processing. 


In both of the experiments shown below, fusion is 
done manually through simulating the actions would 
have been taken by the fusion compiler. Such ac- 
tions include procedure in-lining, data sharing, and 
asynchrony elimination. 


Discussion of Results. The results in Table 4 
indicate the benefits of agent fusion clearly. The 
table’s first three columns show program perfor- 
mance when agents interact via remote object invo- 
cation, where references to other agents are remote 
object references received from a registry service. 
The fourth column shows the performance of the 
same programs in which references to cooperating 
agents are replaced by references to local objects, 
but object fusion has not been performed (ie., the 
compiler did not compile both agents as one unit). 
The next column shows performance subsequent to 
joint agent compilation. 


These experiments demonstrate that it is highly de- 
sirable to put two agents into the same virtual ma- 
chine and treat them as local object to each other if 
they are running on the same host and closely coop- 
erating. In these experiments, the largest gain from 
agent fusion is due to the replacement of remote ob- 
ject invocation. The gains indicated in the last col- 
umn are due solely to the compiler’s use of global op- 
timizations when multiple agents’ code is combined; 
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Different 
Realization Hosts 


Java code 8,909 





lsosurface 






Fir-Range 
processing 





Same Host | Same process Local 
22,377 
back & front-end 





Fully | Fusion 
Fused | Benefit 
1.13% 





30.13% 


Table 4: Result of fusion applied to different agent forms. 


They range from 7.88% to 30.13% and are due to 
inter-class optimizations like inter-procedural alias- 
ing and procedure in-lining, and most importantly, 
the elimination of asynchrony. 


3.3 Summary 


Experiments with sample applications built on top 
of our current infrastructure demonstrate that both 
morphing and fusion have significant performance 
benefits. Morphing provides 70% to 300% gains in 
performance, while fusion provides from 7.88% to 
30.13% additional improvements after co-locating 
two agents in the same process. 


The overheads caused by the acquisition of native 
realizations and by state transformation in morph- 
ing are relatively low, provided that morphing is not 
frequently invoked and that each run of the appli- 
cation lasts reasonablely long. We believe that the 
frequency of morphing will be low in most cases, as 
it needs to be invoked only once, unless migration is 
involved. However, migration itself tends to be an 
expensive activity; morphing will simply add some 
costs to this process. Similar arguments hold for 
object fusion. 


4 Toward a System for Mobile Agent 
Optimization 


The performance benefits derived from agent mor- 
phing and fusion presented in Sections 2 and 3 
are significant. They are motivating us to construct 
an agent system within which agent/object—objent- 
programs are easily constructed and adapted at run- 
time. Such a system consists of contracts for per- 
formance requirement specifications, a notification 
system for contract monitoring, policies for appli- 
cation specified adaptation enactment, and finally, 
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system or application defined adaptations. 


This paper only addresses adaptation methods and 
adaptation enactment mechanisms unique to mobile 
agent-based applications. In our ongoing research, 
we are identifying other aspects unique to mobile 
agents and therefore, appropriately addressed by an 
adaptive Objent system. We expect to base this 
work on previous and current research in distributed 
adaptive systems[34, 6] and in distributed object 
systems[45]. 


Specifically, our Objent system has to address the 
following issues to support morphing and fusion adap- 
tation: 


1. Where and how are native agent forms created 
and maintained? 


2. How does the system ensure consistency be- 
tween the migratory and native versions of 
agent state? 


3. When an agent’s form is being or has been 
changed, can external agents not aware of this 
change continue invoking it? 


4. What are useful fusion algorithms? 


The basic fundamental components of the ‘Objent’ 
system we are developing have been described in 
a previous publication[2]. We next outline our so- 
lution approaches to the specific questions posed 
above. 


Acquiring an Agent’s Native Version. Agents 
are created and migrated in their platform- 
independent(neutral) forms. Their native forms 
may be created by (1) acquisition of a trusted na- 
tive version from the agent’s current execution site, 
involving agent retrieval from a local repository 
and/or its generation by a locally resident com- 
piler, or (2) agent acquisition from a remote, trusted 
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repository to which providers submit agents in forms 
suitable for various platforms. Initial design ideas 
on such a repository are described in [2]. 


Consistency Between Multiple Agent Forms. 
An agent capable of morphing never has more than 
at most two implementations, a platform-neutral 
and a native one. At any one time, only one of 
these implementations is active. This implies that 
agent morphing necessitates state copying from the 
previously active to the new agent form. We intend 
to develop methods for full state copying, for par- 
tial state copying, and for permitting developers to 
provide specialized state maintenance methods. 


Agent Invocation. When an agent changes its 
form, it is not likely that the agent’s new form is 
able to efficiently interpret the objects passed to it 
via invocations to its old form. Moreover, one of 
the purposes of agent morphing is to enable efficient 
and direct communications between agents in their 
native forms whenever possible. 


The solution being developed in our current research 
is one that permits the (inefficient) invocation of re- 
mote agents that differ in form, coupled with the im- 
plementation of notification protocols among agents 
that enable agents to switch to an efficient invo- 
cation protocol whenever possible. Specifically, we 
employ adaptors that are present in all agents ca- 
pable of morphing. Each adaptor has two forms: 
(1) the neutral form is visible to the agent’s neu- 
tral implementation; (2) the native form is visible 
to the native code. An adaptor is much like an ob- 
ject ‘policy’ in that all invocations in the respective 
agent forms are directed to the appropriate adap- 
tor. The adaptor, then, knows about the agent’s 
current form, has methods generated from its inter- 
face definition for request translation from one form 
to the other, and is able to deal with issues arising 
from the agent’s concurrent invocation in both of its 
forms. 


The overheads of using agent adaptors are small 
when agents communicate in the same form, as was 
the case for the overheads incurred by policies evalu- 
ated in [35]. When adaptors must translate between 
forms, overheads depend on the complexities of in- 
vocation parameters. 


Fusion Algorithms. The fusion algorithms used 
in our current work carry out inter-procedural opti- 
mization, and they reduce or eliminate the multi- 
threading overheads caused by asynchronous re- 
mote agent invocation. For example, in PSSPS, 
an asynchronous invocation is implemented as fol- 
lows: with each method in an agent’s interface def- 
inition, we associate a special modifier that de- 
notes whether the method should be invoked syn- 
chronously (SYNC_IF_FUSED) or asynchronously 
(ASYNC_IF_FUSED) by fellow fused agent(s). An 
invocation to SYNC_IF_FUSED methods by a fel- 
low fused agent(s) is replaced by a direct local pro- 
cedure call. The fusion algorithm then applies inter- 
procedural analysis to perform aliasing and, in the 
case of SYNC_IF_FUSED methods, procedure in- 
lining. Aliasing attempts to eliminate unnecessary 
data copying, since data formerly located in differ- 
ent address spaces or on different hosts may poten- 
tially be shared subsequent to agent fusion and co- 
location. 


Fusion may be applied repeatedly, possibly later fol- 
lowed by agent ‘splitting’, if indicated. Agent ‘split- 
ting’ is an agent adaptation method we are aware 
of, it applies program slicing to an agent operating 
on a distributed data set and distributes agent slices 
so that each agent slice operates on some local data 
which is a subset of the distributed data set. 


5 Related Work 


Recent active research on mobile agent systems con- 
cerns the areas of agent facility standardization, mo- 
bile agent system interoperability, and operating sys- 
tem support [17, 20, 24, 38]. The platforms devel- 
oped by such works provide the basic agent sys- 
tem functionality upon which our runtime system 
is built. We add to this functionality the ability to 
adapt mobile agent and we add the event mecha- 
nisms necessary for building dynamic runtime sup- 
port for monitoring and for adaptation initiation 
and enactment. 


Research results from software specialization sys- 
tems like SPIN, Exokernel, and Synthetix may be 
applied to our adaptable agent architecture to cus- 
tomize the agent system itself and/or individual 
agents. We will focus on customization issues more 
specific to mobile agent environments. 
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We have already benefited from research on object 
policies and on meta-objects[35, 16] to develop the 
‘adaptor’ concept presented in section 3. Our work 
will also take advantage of current research on qual- 
ity of service infrastructures like BBN’s QuO[45] 
and Honeywell’s ARA[34], but we will adapt their 
techniques to the mobile agent domain targeted by 
our work. 


Our work is part of a broader project described in 
[1], with early results are presented in [2]. 


6 Conclusions and Future Work 


Agent computing is subject to several inefficien- 
cies, some of which are due to the complexities of 
the environments in which mobile agents are de- 
ployed. Our research is exploring runtime adapta- 
tion and agent specialization to improve the perfor- 
mance of agent-based programs, aiming at enabling 
programmers to employ these techniques and run- 
time adaptation in general, to improve program per- 
formance without sacrificing the fundamental ad- 
vantages promised by mobile agent programming. 
We explore the effects of using two specialization 
approaches, morphing and fusion, on a single mo- 
bile agent and on several cooperating agents. Our 
experimental results with two sample applications, 
ISDA and PSSPS, show that such specialization 
approaches result in considerable performance im- 
provement. 


We have built a preliminary infrastructure for on- 
line morphing, which offers mechanism for inter- 
language remote invocation using the model of invo- 
cation adaptors developed in our research. Infras- 
tructures and mechanisms are applied to the ISDA 
and PSSPS distributed high performance applica- 
tions. Also used with these applications is a real- 
ization of mobile event channels that allow reliable 
event delivery during end-point migration. 


Our future work concerns systematic support for 
specialization approaches like morphing, fusion and 
others such as slicing. This support will comprise 
event mechanisms and quality of service infrastruc- 
ture, both of which are important to a general agent 
adaptation system. We will also work on compilers 
for agent fusion and the adaptation of agent invo- 
cations. 
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Abstract 


First-generation CORBA middleware was reasonably success- 
ful at meeting the demands of request/response applications 
with best-effort quality of service (QoS) requirements. Sup- 
porting applications with more stringent QoS requirements 
poses new challenges for next-generation real-time CORBA 
middleware, however. This paper provides three contributions 
to the design and optimization of real-time CORBA middle- 
ware. First, we outline the challenges faced by real-time ORBs 
implementers, focusing on optimization principle patterns that 
can be applied to CORBA’s Object Adapter and ORB Core. 
Second, we describe how TAO, our real-time CORBA imple- 
mentation, addresses these challenges and applies key ORB 
optimization principle patterns. Third, we present the results 
of empirical benchmarks that compare the impact of TAO’s 
design strategies on ORB efficiency, predictability, and scala- 
bility. 

Our findings indicate that ORBs must be highly config- 
urable and adaptable to meet the QoS requirements for a wide 
range of real-time applications. In addition, we show how 
TAO can be configured to perform predictably and scalably, 
which is essential to support real-time applications. A key re- 
sult of our work is to demonstrate that the ability of CORBA 
ORBs to support real-time systems is mostly an implementa- 
tion detail. Thus, relatively few changes are required to the 
standard CORBA reference model and programming API to 
support real-time applications. 


1 Introduction 

Many companies and research groups are developing dis- 

tributed applications using middleware components like 
“Work done by the author while at Washington University. 
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CORBA Object Request Brokers (ORBs) [1]. CORBA helps 
to improve the flexibility, extensibility, maintainability, and 
reusability of distributed applications [2]. However, a growing 
class of distributed real-time applications also require ORB 
middleware that provides stringent quality of service (QoS) 
support, such as end-to-end priority preservation, hard upper 
bounds on latency and jitter, and bandwidth guarantees [3]. 
Figure 1 depicts the layers and components of an ORB endsys- 
tem that must be carefully designed and systematically opti- 
mized to support end-to-end application QoS requirements. 


: tic O OBJECT 
operation 
cust) 2? 
out args + return value 
«oO 


PRESENTATION 
LAYER 






DATA COPYING 
& MEMORY 





OS KERNEL 










MANAGEMENT 





VO 
SUBSYSTEM 


Os /O SUBSYSTEM 
NETWORK INTERFACES 







<—— NETWORK 
ADAPTER 


Figure |: Real-time Features and Optimizations Necessary to 
Meet End-to-end QoS Requirements in ORB Endsystems 


First-generation ORBs lacked many of the features and op- 
timizations [4, 5, 6, 7] shown in Figure 1. This situation was 
not surprising, of course, since the focus at that time was 
largely on developing core infrastructure components, such as 
the ORB and its basic services, defined by the OMG speci- 
fications [8]. In contrast, second-generation ORBs, such as 
The ACE ORB (TAO) [9], explicitly focus on providing end- 
to-end QoS guarantees to applications vertically (i.e., network 
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interface + application layer) and horizontally (i.e., end-to- 
end) integrating highly optimized CORBA middleware with 
OS I/O subsystems, communication protocols, and network 
interfaces. 

Our previous research has examined many dimensions of 
high-performance and real-time ORB endsystem design, in- 
cluding static [9] and dynamic [10] scheduling, event process- 
ing [11], I/O subsystem integration [12], ORB Core connec- 
tion and concurrency architectures [7], systematic benchmark- 
ing of multiple ORBs [4], and design patterns for ORB ex- 
tensibility [13]. This paper focuses on four more dimensions 
in the high-performance and real-time ORB endsystem design 
space: Object Adapter and ORB Core optimizations for (1) 
request demultiplexing, (2) collocation, (3) memory manage- 
ment, and (4) ORB protocol overhead. 

The optimizations used in TAO are guided by a set of prin- 
ciple patterns [14] that have been applied to optimize mid- 
dleware [15] and lower-level networking software [16], such 
as TCP/IP. Optimization principle patterns document rules 
for avoiding common design and implementation problems 
that degrade the performance, scalability, and predictability of 
complex systems. The optimization principle patterns we ap- 
plied to TAO include: optimizing for the common case; elim- 
inating gratuitous waste; shifting computation in time such 
as precomputing; avoiding unnecessary generality; passing 
hints between layers; not being tied to reference implemen- 
tations; using specialized routines; leveraging system compo- 
nents by exploiting locality; adding state; and using efficient 
data structures. Below, we outline how these optimization 
principle patterns address the following TAO Object Adapter 
and ORB Core design and implementation challenges. 


Optimizing request demultiplexing: The time an ORB’s 
Object Adapter spends demultiplexing requests to target ob- 
ject implementations, i.e., servants, can constitute a signifi- 
cant source of ORB overhead for real-time applications. Sec- 
tion 2 describes how Object Adapter demultiplexing strategies 
impact the scalability and predictability of real-time ORBs. 
This section also illustrates how TAO’s Object Adapter opti- 
mizations enable constant time request demultiplexing in the 
average- and worst-case, regardless of the number of objects 
or operations configured into an ORB. The principle patterns 
that guide our request demultiplexing optimizations include 
precomputing, using specialized routines, passing hints in pro- 
tocol headers, and not being tied to reference models. 


Optimizing collocation: The principle pattern of relaxing 
system requirements enables TAO to minimize the run-time 
overhead for collocated objects, i.e., objects that reside in the 
same address space as their client(s). Operations on collo- 
cated objects are invoked on servants directly in the context 
of the calling thread, thereby transforming operation invoca- 
tions into local virtual method calls. Section 3.1 describes how 
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TAO’s collocation optimizations are completely transparent to 
clients, i.e., collocated objects can be used as regular CORBA 
objects, with TAO handling all aspects of collocation. 


Optimizing memory management: ORBs allocate buffers 
to send and receive (de)marshaled data. It is important to opti- 
mize these allocations since they are a significant source of 
dynamic memory management and locking overhead. Sec- 
tion 3.2 describes the mechanisms used in TAO to allocate 
and manipulate the internal buffers it uses for parameter 
(de)marshaling. We illustrate how TAO minimizes fragmenta- 
tion, data copying, and locking for most application use-cases. 
The principle patterns of exploiting locality and optimizing for 
the common case influence these optimizations. 


Minimizing ORB protocol overhead: Real-time systems 
have traditionally been developed using proprietary protocols 
that are hard-coded for each application or application family. 
In theory, the standard CORBA GIOP/IIOP protocols obvi- 
ate the need for proprietary protocols. In practice, however, 
many developers of real-time applications are justifiably con- 
cerned that standard CORBA protocols incur excessive over- 
head. Section 3.3 shows how TAO can be configured to re- 
duce the overhead of GIOP/IIOP without affecting the stan- 
dard CORBA programming APIs exposed to application de- 
velopers. This optimization is based on the principle pattern 
of avoiding unnecessary generality. 


The remainder of this paper is organized as follows: Sec- 
tion 2 outlines the Portable Object Adapter (POA) architecture 
of CORBA ORBs and evaluates the design and performance of 
POA optimizations used in TAO; Section 3 outlines the ORB 
Core architecture of CORBA ORBs and evaluates the design 
and performance of ORB Core optimizations used in TAO; 
Section 4 describes related work; and Section 5 provides con- 
cluding remarks. 


2 Optimizing the POA for Real-time 
Applications 


2.1 POA Overview 


The OMG CORBA 2.2 specification [1] standardizes sev- 
eral components on the server-side of CORBA-compliant 
ORBs. These components include the Portable Object Adapter 
(POA), standard interfaces for object implementations (i.e., 
servants), and refined definitions of skeleton classes for var- 
ious programming languages, such as Java and C++ [2]. 
These standard POA features allow application developers 
to write more flexible and portable CORBA servers [17]. They 
also make it possible to conserve resources by activating ob- 
jects on-demand [18] and to generate “persistent” object ref- 
erences [19] that remain valid after the originating server pro- 
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cess terminates. Server applications can configure these new 
features portably using policies associated with each POA. 

CORBA 2.2 allows server developers to create multiple Ob- 
ject Adapters, each with its own set of policies. Although this 
is a powerful and flexible programming model, it can incur 
significant run-time overhead because it complicates the re- 
quest demultiplexing path within a server ORB. This is partic- 
ularly problematic for real-time applications since naive Ob- 
ject Adapter implementations can increase priority inversion 
and non-determinism [6]. 

Optimizing a POA to support real-time applications requires 
the resolution of several design challenges. This section out- 
lines these challenges and describes the optimization princi- 
ple patterns we applied to maximize the predictability, perfor- 
mance, and scalability of TAO’s POA. These POA optimiza- 
tions include constant-time demultiplexing strategies, reduc- 
ing run-time object key processing overhead during upcalls, 
and generally optimizing POA predictability and reducing 
memory footprint by selectively omitting non-deterministic 
POA features. 


2.2 Optimizing POA Demultiplexing 


Scalable and predictable POA demultiplexing is important for 
many applications such as real-time stock quote systems [20] 
that service a large number of clients, and avionics mission 
systems [11] that have stringent hard real-time timing con- 
straints. Below, we outline the steps involved in demultiplex- 
ing a client request through the server-side of a CORBA ORB 
and then qualitatively and quantitatively evaluate alternative 
demultiplexing strategies. 


2.2.1 Overview of CORBA Request Demultiplexing 


A standard GIOP-compliant client request contains the iden- 
tity of its object and operation. An object is identified by an 
object key, which is an octet sequence. An operation is 
represented as a string. As shown in Figure 2, the ORB 
endsystem must perform the following demultiplexing tasks: 


Steps land 2: The OS protocol stack demultiplexes the in- 
coming client request multiple times, starting from the net- 
work interface, through the data link, network, and transport 
layers up to the user/kernel boundary (e.g., the socket layer), 
where the data is passed to the ORB Core in a server process. 


Steps 3,and 4: The ORB Core uses the addressing informa- 
tion in the client’s object key to locate the appropriate POA 
and servant. POAs can be organized hierarchically. There- 
fore, locating the POA that contains the designated servant can 
involve a number of demultiplexing steps through the nested 
POA hierarchy. 
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Figure 2: CORBA 2.2 Logical Server Architecture 





Step 5and 6: The POA uses the operation name to find the 
appropriate IDL skeleton, which demarshals the request buffer 
into operation parameters and performs the upcall to code sup- 
plied by servant developers to implement the object’s opera- 
tion. 


The conventional deeply-layered ORB endsystem demulti- 
plexing implementation shown in Figure 2 is generally inap- 
propriate for high-performance and real-time applications for 
the following reasons [21]: 


Decreased efficiency: Layered demultiplexing reduces per- 
formance by increasing the number of internal tables that 
must be searched as incoming client requests ascend through 
the processing layers in an ORB endsystem. Demultiplexing 
client requests through all these layers is expensive, particu- 
larly when a large number of operations appear in an IDL in- 
terface and/or a large number of servants are managed by an 
Object Adapter. 


Increased priority inversion and non-determinism: Lay- 
ered demultiplexing can cause priority inversions because 
servant-level quality of service (QoS) information is inacces- 
sible to the lowest-level device drivers and protocol stacks in 
the I/O subsystem of an ORB endsystem. Therefore, an Ob- 
ject Adapter may demultiplex packets according to their FIFO 
order of arrival. FIFO demultiplexing can cause higher prior- 
ity packets to wait for a non-deterministic period of time while 
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lower priority packets are demultiplexed and dispatched [12]. 


Conventional implementations of CORBA incur significant 
demultiplexing overhead. For instance, [4, 6] show that con- 
ventional ORBs spend ~17% of the total server time process- 
ing demultiplexing requests. Unless this overhead is reduced 
and demultiplexing is performed predictably, ORBs cannot 
provide uniform, scalable QoS guarantees to real-time appli- 
cations. 

The remainder of this section focuses on demultiplexing op- 
timizations performed at the ORB layer, i.e., steps 3 through 6. 
Information on OS kernel layer demultiplexing optimizations 
for real-time ORB endsystems is available in [22, 12]. 


2.2.2 Overview of Alternative Demultiplexing Strategies 


As illustrated in Figure 2, demultiplexing a request to a ser- 
vant and dispatching the designated servant operation involves 
several steps. Below, we qualitatively outline the most com- 
mon demultiplexing strategies used in CORBA ORBs. Sec- 
tion 2.2.3 then quantitatively evaluates the strategies that are 
appropriate for each layer in the ORB. 


Linear search: This strategy searches through a table se- 
quentially. If the number of elements in the table is small, 
or the application has no stringent QoS requirements, linear 
search may be an acceptable demultiplexing strategy. For real- 
time applications, however, linear search is undesirable since 
it does not scale up efficiently or predictably to a large num- 
ber of servants or operations. In this paper, we evaluate linear 
search only to provide an upper-bound on worst-case perfor- 
mance, though some ORBs [4] use linear search for operation 
demultiplexing. 


Binary search: Binary search is a more scalable demulti- 
plexing strategy than linear search since its O(lgn) lookup 
time is effectively constant for most applications. However, 
insertions and deletions can be complicated since data must 
be sorted for the binary search algorithm to work correctly. 
Therefore, binary search is particularly useful for ORB opera- 
tion demultiplexing since all insertions and sorting can be per- 
formed off-line by an IDL compiler. In contrast, using binary 
search to demultiplex requests to servants is more problem- 
atic since servants can be inserted or removed dynamically at 
run-time. 


Dynamic hashing: Many ORBs use dynamic hashing as 
their Object Adapter demultiplexing strategy. Dynamic hash- 
ing provides O(1) performance for the average case and sup- 
ports dynamic insertions more readily than binary search. 
However, due to the potential for collisions, its worst-case ex- 
ecution time is O(n), which makes it inappropriate for hard 
real-time applications that require efficient and predictable 
worst-case ORB behavior. Moreover, depending on the hash 
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algorithm, dynamic hashing often has a fairly high constant 
overhead [6]. 


Perfect hashing: If the set of operations or servants is 
known a priori, dynamic hashing can be improved by pre- 
computing a collision-free perfect hash function [23]. Perfect 
Hashing is based on the principle pattern of precomputing and 
using specialized routines. A demultiplexing strategy based 
on perfect hashing executes in constant time and space. This 
property makes perfect hashing well-suited for deterministic 
real-time systems that can be configured statically [6], i-e., the 
number of objects and operations can be determined off-line. 


Active demultiplexing: Although the number and names of 
operations can be known a priori by an IDL compiler, the 
number and names of servants are generally more dynamic. 
In such cases, it is possible to use the object ID and POA ID 
stored in an object key to index directly into a table managed 
by an Object Adapter. Active demultiplexing uses the princi- 
ple pattern of relaxing system requirements, not being tied to 
reference models, and passing hints in headers. This so-called 
active demultiplexing [6] strategy provides a low-overhead, 
O(1) lookup technique that can be used throughout an Object 
Adapter. 


Table 1 summaries the demultiplexing strategies considered 
in the implementation of TAO’s POA. 


[Strategy | Search Time | Comments _—_—_— 
Linear O(n) Simple to implement 
Search Does not scale 
Binary O(lg n) Additions/deletions 
Search are expensive 
Perfect O(1) worst case For static configurations, 
Hashing generate collision-free 
Active O(1) worst case For system generated 
Demuxing 

information to keys 




















Dynamic O(1) average case | Hashing overhead 
Hashing O(n) worst case 

hashing functions 

keys, add direct indexing 












Table 1: Summary of Alternate POA Demultiplexing Strate- 
gies 


2.2.3. The Performance of Alternative POA Demultiplex- 
ing Strategies 


Section 2.2.1 describes the demultiplexing steps a CORBA re- 
quest goes through before it is dispatched to a user-supplied 
servant method. These demultiplexing steps include finding 
the Object Adapter, the servant, and the skeleton code. This 
section empirically evaluates the strategies that TAO uses for 
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each demultiplexing step. All POA demultiplexing measure- 
ments were conducted on an UltraSPARC-II with two 300 
MHz CPUs, a 512 Mbyte RAM, running SunOS 5.5.1, and 
C++ Workshop Compilers version 4.2. 


POA lookup: An ORB Core must locate the POA corre- 
sponding to an incoming client request. Figure 2 shows that 
POAs can be nested arbitrarily. Although nesting provides a 
useful way to organize policies and namespaces hierarchically, 
the POA’s nesting semantics complicate demultiplexing com- 
pared with the original CORBA Basic Object Adapter (BOA) 
demultiplexing [6] specification. 

We conducted an experiment to measure the effect of in- 
creasing the POA nesting level on the time required to lookup 
the appropriate POA in which the servant is registered. We 
used a range of POA depths, | through 25. The results are 
shown in Figure 3. 


Latency (us) 





POA Depth 


Figure 3: Effect of POA Depth on POA Demultiplexing La- 
tency 


Since most ORB server applications do not have deeply 
nested POA hierarchies, TAO currently uses a POA demulti- 
plexing strategy where each POA finds its child using dynamic 
hashing and delegates to the child POA where this process is 
repeated until the search is complete. This POA demultiplex- 
ing strategy results in O(n) growth for the lookup time and 
does not scale up to deeply nested POAs. Therefore, we are 
adding active demultiplexing to the POA lookup phase, which 
operates as follows: 


1. All lookups start at the Root POA. 


2. The Root POA will maintain a POA table that points 
to all the POAs in the hierarchy. 


3. Object keys will include an index into the POA table 
to identify the POA where the object was activated. 
TAO’s ORB Core will use this index as the active demul- 
tiplexing key. 

4. In some cases, the POA name also may be needed, e.g., 
if the POA is activated on-demand. Therefore, the object 
reference will contain both the name and the index. 


Using active demultiplexing for POA lookup should provide 
optimal predictability and scalability, just as it does when used 
for servant demultiplexing, which is described next. 


Servant demultiplexing: Once the ORB Core demulti- 
plexes a client request to the right POA, this POA demulti- 
plexes the request to the correct servant. The following discus- 
sion compares the various servant demultiplexing techniques 
described in Section 2.2.2. TAO uses the Service Configu- 
rator [24], Bridge, and Strategy design patterns [25] to defer 
the configuration of the desired servant demultiplexing strat- 
egy until ORB initialization, which can be performed either 
statically (i.e., at compile-time) or dynamically (i.e., at run- 
time) [13]. Figure 4 illustrates the class hierarchy of strategies 
that can be configured into TAO’s POAs. 


DemuxTable ____ <sforwards>> | Table Ji | 
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/\ 


L 
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Linear Search | | 
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Figure 4: TAO’s Class Hierarchy for POA Active Object Map 
Strategies 
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To evaluate the scalability of TAO, our experiments used a 
range of servants, | to 500 by increments of 100, in the server. 
Figure 5 shows the latency for servant demultiplexing as the 
number of servants increases. This figure illustrates that ac- 
tive demultiplexing is a highly predictable, low-latency servant 
lookup strategy. In contrast, dynamic hashing incurs higher 
constant overhead to compute the hash function. Moreover, 
its performance degrades gradually as the number of servants 
increases and the number of collisions in the hash table in- 
crease. Likewise, linear search does not scale for any realistic 
system, i.e., its performance degrades rapidly as the number of 
servants increase. 
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Latency (us) 


No. of Objects 


Figure 5: Servant Demultiplexing Latency with Alternative 
Search Techniques 


Note that we did not implement the perfect hashing strategy 
for servant demultiplexing. Although it is possible to know the 
set of servants on each POA for certain statically configured 
applications a priori, creating perfect hash functions repeat- 
edly during application development is tedious. We omitted 
binary search for similar reasons, i.e., it requires maintaining 
a sorted active object map every time an object is activated 
or deactivated. Moreover, since the object key is created by 
a POA, active demultiplexing provides equivalent, or better, 
performance than perfect hashing or binary search. 


Operation demultiplexing: The final step at the Object 
Adapter layer involves demultiplexing a request to the appro- 
priate skeleton, which demarshals the request and dispatches 
the designated operation upcall in the servant. To measure 
operation demultiplexing overhead, our experiments defined 
a range of operations, | through 50, in the IDL interface. 

For ORBs like TAO that target real-time embedded systems, 
operation demultiplexing must be efficient, scalable, and pre- 
dictable. Therefore, we generate efficient operation lookup 
using GPERF [23], which is a freely available perfect hash 
function generator we developed. 

GPERF [26] automatically constructs perfect hash func- 
tions from a user-supplied list of keywords. In addition to the 
perfect hash functions, GPERF can also generate linear and 
binary search strategies. 

Figure 6 illustrates the interaction between the TAO IDL 
compiler and GPERF. When perfect hashing, linear search and 
binary search operation demultiplexing strategies are selected, 
TAO’s IDL compiler invokes GPERF as a co-process to gen- 
erate an optimized lookup strategy for operation names in IDL 
interfaces. 





BINARY SEARCH 





Figure 6: Integrating TAO’s IDL Compiler and GPERF 


The lookup key for this phase is the operation name, which 
is a string defined by developers in an IDL file. However, 
it is not permissible to modify the operation string name 
to include active demultiplexing information. Since active de- 
multiplexing cannot be used without modifying the GIOP pro- 
tocol.! TAO uses perfect hashing for operation demultiplex- 
ing. Perfect hashing is well-suited for this purpose since all 
operations names are known at compile time. 

Figure 7 plots operation demultiplexing latency as a func- 
tion of the number of operations. This figure illustrates that 
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Figure 7: Operation Demultiplexing Latency with Alternative 
Search Techniques 


perfect hashing is extremely predictable and efficient, outper- 
forming dynamic hashing and binary search. As expected, lin- 
ear search depends on the number and ordering of operations, 
which complicates worst-case schedulability analysis for real- 
time applications. 


'We are investigating modifications to the GIOP protocol for hard real- 
time systems that possess stringent latency and message-footprint require- 
ments. 
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Optimizing servant-based lookups: When a CORBA re- 
quest is dispatched by the POA to the servant, the POA uses 
the Object Id in the request header to find the servant in its Ac- 
tive Object Map. Section 2.2.3 describes how TAO’s lookup 
strategies provide efficient, predictable, and scalable mecha- 
nisms to dispatch requests to servants based on Object Ids. In 
particular, TAO’s Active Demultiplexing strategy enables con- 
stant O(1) lookup in the average- and worst-case, regardless 
of the number of servants in a POA’s Active Object Map. 

However, certain POA operations and policies require 
lookups on Active Object Map to be based on the ser- 
vant pointer rather than the Object Id. For instance, 
the _this method on the servant can be used with the 
IMPLICIT_ACTIVATION POA policy outside the context of 
request invocation. This operation allows a servant to be ac- 
tivated implicitly if the servant is not already active. If the 
servant is already active, it will return the object reference cor- 
responding to the servant. 

Unfortunately, naive POA’s Active Object Map imple- 
mentations incur worst-case performance for servant-based 
lookups. Since the primary key is the Object Id, servant-based 
lookups degenerate into a linear search, even when Active 
Demultiplexing is used for the Object Id-based lookups. As 
shown in Figure 5, linear search is prohibitively expensive as 
the number of servants in the Active Object Map increases. 
This overhead is particularly problematic for real-time appli- 
cations, such as avionics mission computing systems [11], that 
(1) create a large number of objects using -this during their 
initialization phase and (2) must reinitialize rapidly to recover 
from transient power failures. 

To alleviate servant-based lookup bottlenecks, we apply the 
principle pattern of adding extra state to the POA in the form 
of a Reverse-Lookup map that associates each servant with its 
Object Id in O(1) average-case time. In TAO, this Reverse- 
Lookup map is used in conjunction with the Active Demulti- 
plexing map that associates each Object Id to its servant. Fig- 
ure 8 shows the time required to find a servant, with and with- 
out the Reverse-Lookup map, as the number of servants in a 
POA increases. 

Servants are allocated from arbitrary memory locations. 
Since we have no control over the pointer value format, TAO 
uses a hash map for the Reverse-Lookup map. The value of the 
servant pointer is used as the hash key. Although hash maps 
do not guarantee O(1) worst-case behavior, they do provide a 
significant average-case performance improvement over linear 
search. 

A Reverse-Lookup map can be used only with the 
UNIQUE-_ID POA policy since with the MULTIPLE_ID POA 
policy, a servant may support many Object Ids. This constraint 
is not a shortcoming since servant-based lookups are only re- 
quired with the UNIQUE_ID policy. One downside of adding 
a Reverse-Lookup map to the POA, however, is the increased 
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Number of Servants 


Figure 8: Benefits of Adding a Reverse-Lookup Map to the 
POA 


overhead of maintaining an additional table in the POA. For 
every object activation and deactivation, two updates are re- 
quired in the Active Object Map: (1) to the Reverse-Lookup 
map and the (2) to the Active Demultiplexing map used for 
Object Id-based lookups. However, this additional process- 
ing does not affect the critical path of Object Id-based lookups 
during run-time. 


Summary of TAO’s POA demultiplexing strategies: 
Based on the results of our benchmarks described above, 
Figure 9 summarizes the demultiplexing strategies that we 
have determined to be most appropriate for real-time appli- 
cations [11]. Figure 9 shows the use of active demultiplex- 


PERFECT 
HASHING 





Figure 9: TAO’s Default Demultiplexing Strategies 


ing for the POA names, active demultiplexing for the servants, 
and perfect hashing for the operation names. Our previous 
experience [27, 4, 28, 6, 7] measuring the performance of 
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CORBA implementations showed TAO is more efficient and 
predictable than widely used conventional CORBA ORBs. 


All of TAO’s optimized demultiplexing strategies described 
above are entirely compliant with the CORBA specification. 
Thus, no changes are required to the standard POA interfaces 
specified in CORBA specification [1]. 


2.3 Optimizing Object Key Processing in POA 
Upcalls 

Motivation: Since the POA is in the critical path of request 

processing in a server ORB, it is important to optimize its pro- 

cessing. Figure 10 shows a naive way to parse an object key. 

In this approach, the object key is parsed and the individual 


Object Key 


P353becdb00094ae8/firstPOA/myservant 





Time Stamp Objeet Is 


Figure 10: Naive Parsing of Object Keys 


fields of the key are stored in separate components. Unfor- 
tunately, this approach (1) allocates memory dynamically for 
each individual object key field and (2) copies data to move 
the object key fields into individual objects. 


TAO’s object key upcall optimizations: TAO provides the 
following object key optimizations based on the principle pat- 
terns of avoiding obvious waste and avoiding unnecessary 
generality. TAO leverages the fact that the object key is avail- 
able through the entire upcall and is not modified. Thus, 
the individual components in the object key can be optimized 
to point directly to their correct locations, as shown in Fig- 
ure 11. This eliminates wasteful memory allocations and data 
copies. This optimization is entirely compliant with the stan- 
dard CORBA specification. 


2.4 Optimizing POA Predictability and Mini- 
mizing Footprint 


Motivation: To adequately support real-time applications, 
an ORB’s Object Adapter must be predictable and minimal. 
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Figure 11: TAO’s Optimized Parsing of Object Keys 


For instance, it must omit non-deterministic operations to im- 
prove end-to-end predictability. Likewise, it must provide a 
minimal memory footprint to support embedded systems [15]. 


TAO’s predictability optimizations: Based on the princi- 
ple patterns of avoiding unnecessary generality and relaxing 
system requirements, we enhanced TAO’s POA to selectively 
disable the following features in order to improve end-to-end 
predictability of request processing: 


e Servant Managers are not required: There is no need 
to locate servants in a real-time environment since all servants 
must be registered with POAs a priori. 


e Adapter Activators are not required: Real-time ap- 
plications create all their POAs at the beginning of execution. 
Therefore, they need not use or provide an adapter activator. 
The alternative is to create POAs during request processing, in 
which case end-to-end predictability is hard to achieve. 


e POA Managers are not required: The POA must not 
introduce extra levels of queueing in the ORB. Queueing can 
cause priority inversion and excessive locking. Therefore, the 
POA Manager in TAO can be disabled. 


TAO’s footprint optimizations: In addition to increasing 
the predictability of POA request processing, omitting these 
features also decreases TAO’s memory footprint. These omis- 
sions were done in accordance with the Minimum CORBA 
specification [29], which removes the following features from 
the CORBA 2.2 specification [1]: 


e Dynamic Skeleton Interface 

e Dynamic Invocation Interface 
e Dynamic Any 

e Interceptors 

e Interface Repository 

e Advanced POA features 

e CORBA/COM interworking 
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Component CORBA | Minimum | Percentage 
CORBA | Reduction 





26.5 
4.8 


[381.886 
ORB Cow [347,080 | 330.04 
Dynamic Any 31305 | __01 
CDR Interpreter [68.687 | CBT 
IDL Compiler | 10488-10312 
Pluggable Protocols |] 14,610 | 14,674 

| 

| 


Default Resources 7,919 7,975 
861,985 | 639,456 


Table 2: Comparison of CORBA with Minimum CORBA 
Memory Footprint 





Table 2 shows the footprint reduction achieved when the 
features listed above are excluded from TAO. The 25.8% re- 
duction in memory footprint for Minimum CORBA is fairly 
significant. However, we plan to reduce the footprint of TAO 
even further by streamlining its CDR Interpreter [15]. In Min- 
imum CORBA, TAO’s CDR Interpreter only needs to support 
the static skeleton interface (SSI) and static invocation inter- 
face (SII). Thus, support for the dynamic skeleton interface 
(DSI) and dynamic invocation interface (DI) can be omitted. 


3 Optimizing the ORB Core for Real- 
time Applications 


The ORB Core is a standard component in CORBA that is re- 
sponsible for connection and memory management, data trans- 
fer, endpoint demultiplexing, and concurrency control [1]. 
An ORB Core is typically implemented as a run-time library 
linked into both client and server applications. When a client 
invokes an operation on an object, the ORB Core is responsi- 
ble for delivering the request to the object and returning a re- 
sponse, if any, to the client. For objects executing remotely, a 
CORBA-compliant ORB Core transfers requests via the Gen- 
eral Inter-ORB Protocol (GIOP), which is commonly imple- 
mented with the Internet Inter-ORB Protocol (IIOP) that runs 
atop TCP. 

Optimizing a CORBA ORB Core to support real-time ap- 
plications requires the resolution of many design challenges. 
This section outlines several of these challenges and describes 
the optimization principle patterns we applied to maximize 
the predictability, performance, and scalability of TAO’s ORB 
Core. These optimizations include transparently collocating 
clients and servants that are in the same address space, mini- 
mizing dynamic memory allocations and data copies, and min- 
imizing GIOP/IIOP protocol overhead. Additional optimiza- 
tions for real-time ORB Core connection management and 
concurrency strategies are described in [30]. 
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3.1 Collocation Optimizations 


Motivation: In addition to separating interface and imple- 
mentation, a key strength of CORBA is its decoupling of (1) 
servant implementations from (2) how servants are configured 
into server processes throughout a distributed system. In prac- 
tice, CORBA is used primarily to communicate between re- 
mote objects. However, there are configurations where a client 
and servant must be collocated in the same address space [31]. 
In this case, there is no need to incur the overhead of data mar- 
shaling or transmitting requests and replies through a “loop- 
back” transport device, which is an application of the principle 
pattern of avoiding obvious waste. 


TAO’s collocation optimization technique: TAO’s POA 
optimizes for collocated client/servant configurations by gen- 
erating a special stub for the client, which is an application 
of the principle pattern of relaxing system requirements. This 
stub forwards all requests to the servant and eliminates data 
marshaling, which is an application of the principle pattern 
of avoiding waste. Figure 12 shows the classes produced by 
TAO’s IDL compiler. 
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Figure 12: TAO’s POA Mapping and Collocation Class 


The stub and skeleton classes shown in Figure 12 are re- 
quired by the POA specification; the collocation class is spe- 
cific to TAO. Collocation is transparent to the client since it 
only accesses the abstract interface and never uses the collo- 
cation class directly. Therefore, the "OA provides the colloca- 
tion class, rather than the regular stu. class, when the servant 
resides in the same address space as the client. 


Supporting transparent collocation in TAO: Clients can 
obtain an object reference in several ways, e.g., from 
a CORBA Naming Service or from a Lifecycle Ser- 
vice generic factory operation. Likewise, clients can use 
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string_to_object to convert a stringified interoperable 
object reference (IOR) into an object reference. To ensure lo- 
cality transparency, an ORB’s collocation optimization must 
determine if an object is collocated. If it is, the ORB returns a 
collocated stub — if it is not, the ORB returns a regular stub to 
a distributed object. 

The specific steps used by TAO’s collocation optimizations 
are described below: 


Step 1 - Determining collocation: To determine if an 
object reference is collocated, TAO’s ORB Core maintains a 
collocation table, which applies the principle of maintaining 
extra state. Figure 13 shows the internal structure for collo- 
cation table management in TAO. Each collocation table maps 


RootPOA a 10." 


= 0.1 
| PortableServer::POA -~. 


J 





| CORBA::ORB | _ 





A | " 
10 \ 


= Table Entry 


Collocation Table oo 2 ndpoint : Addr : 
Sai , 1..* GBpoa : PortableServer::POA 


i Ue ed 
Table Collection aN 


Figure 13: Class Relationship of TAO’s Collocation Tables 


an ORB’s transport endpoints to its RootPOA. In the case of 
IIOP, endpoints are specified using {hostname, port number} 
tuples. 

Multiple ORBs can reside in a single server process. Each 
ORB can support multiple transport protocols and accept re- 
quests from multiple transport endpoints. Therefore, TAO 
maintains multiple collocation tables for all transport proto- 
cols used by ORBs within a single process. Since different 
protocols have different addressing methods, maintaining pro- 
tocol specific collocation tables allows us to strategize and op- 
timize the lookup mechanism for each protocol. 


Step 2 - Obtaining a reference to a collocated object: A 
client acquires an object reference either by resolving an im- 
ported JOR using string_to_-object or by demarshaling 
an incoming object reference. In either case, TAO examines 
the corresponding collocation tables according to the profiles 
carried by the object to determine if the object is collocated 
or not. If the object is collocated, TAO performs the series of 
steps shown in Figure 14 to obtain a reference to the collocated 
object. 
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Figure 14: Finding a Collocated Object in TAO 
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As shown in Figure 14, when a client process tries to resolve 
an imported object reference (1), the ORB checks (2) the col- 
location table maintained by TAO’s ORB Core to determine if 
any object endpoints are collocated. If a collocated endpoint is 
found this check succeeds and the RootPOA corresponding to 
the endpoint is returned. Next, the matching Object Adapter 
is queried for the servant, starting at its RootPOA (3). The 
ORB then instantiates a generic CORBA: :Object (4) and 
invokes the narrow operation on it. If a servant is found, the 
ORB’s narrow operation (5) invokes the servant’s narrow 
method (6) and a collocated stub is instantiated and returned to 
the client (7). Finally, clients invoke operations (8) on the col- 
located stub, which forwards the operation to the local servant 
via a virtual method call. 

If the imported object reference is not collocated, then either 
operation (2) or (3) will fail. In this case, the ORB invokes the 
-is_a method to verify that the remote object matches the tar- 
get type. If the test succeeds, a distributed stub is created and 
returned to the client. All subsequent operations are invoked 
remotely. Thus, the process of selecting collocated stubs or 
non-collocated stubs is completely transparent to clients and 
it’s only performed at the time of object reference creation. 


Step 3 - Performing collocated object invocations: Col- 
located operation invocations in TAO borrow the client’s 
thread-of-control to execute the servant’s operation. There- 
fore, they are executed within the client thread at its thread 
priority. 

Although executing an operation in the client’s thread is 
very efficient, it is undesirable for certain types of real-time 
applications [32]. For instance, priority inversion can occur 
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when a client in a lower priority thread invokes operations 
on a collocated object in a higher priority thread. To pro- 
vide greater access control over the scope of TAO’s colloca- 
tion optimizations, applications can associate different access 
policies to endpoints so they only appear collocated to cer- 
tain priority groups. Since endpoints and priority groups in 
many real-time applications are statically configured, this ac- 
cess control lookup does not impose additional overhead. 


Empirical results: To measure the performance gain from 
TAO’s collocation optimizations, we ran server and client 
threads in the same process. Two platforms were used to 
benchmark the test program: a dual 300 Mhz UltraSparc-II 
running SunOS 5.5.1 and a dual 400 Mhz Pentium-II running 
Microsoft Windows NT 4.0 (SP3.) The test program was run 
both with TAO’s collocation optimizations enabled and dis- 
abled to compare the performance systematically. 

Figure 15 shows the performance improvement, measured 
in calls-per-second, using TAO’s collocation optimizations. 
Each operation cubed a variable-length sequence of longs 
that contained 4 and 1,024 elements, respectively. As ex- 


calls/sec 


. 





cube_small_sequence<long> cube_large_sequence<long> 
Operations 





| Solaris w/o Collocation NT w/o Collocation 


| Solaris w/ Collocation NT w/ Collocation 


Figure 15: Results of TAO’s Collocation Optimizations 


pected, collocation greatly improves the performance of op- 
eration invocations when servants are collocated with clients. 
Our results show, depending on the size of arguments passed 
to the operations, performance improves from 2,000% to 
200,000%. Although the test results are foreseeable, they 
show that by using TAO’s collocation optimization, invoca- 
tions on collocated CORBA objects can be as fast as calling 
functions on local C++ objects. 


TAO’s collocation optimizations are not totally compliant 
with the CORBA standard since its collocation class forwards 


all requests directly to the servant class. Although this makes 
the common case very efficient, this implementation does not 
support the following advanced POA features: 


e POA: :Current is not setup 
e Interceptors are bypassed 
e POA Manager state is ignored 


e Servant Managers are not consulted 


Etherealized servants can cause problems 


Location forwarding is not supported 


The POA’s Thread_Policy is circumvented 


Adding support for these features to TAO’s collocation class 
slow downs the collocation optimization, which is why TAO 
currently omits these features. We plan to support these ad- 
vanced features in future releases of TAO so that if applica- 
tions know these advanced features are not required they can 
be ignored selectively. 


3.2 Memory Management Optimizations 


Motivation: A key source of overhead and non-determinism 
in conventional ORB Core implementations is improper man- 
agement of memory buffers. Memory buffers are used by 
CORBA clients to send requests containing marshaled param- 
eters. Likewise, CORBA servers use memory buffers to re- 
ceive requests containing marshaled parameters. 

One source of memory management overhead stems from 
the use of dynamic memory allocation, which is problem- 
atic for real-time ORBs. For instance, dynamic memory can 
fragment the global process heap, which decreases ORB pre- 
dictability. Likewise, locks used to access a global heap from 
multiple threads can increase synchronization overhead and 
incur priority inversion [30]. 

Another significant source of memory management over- 
head involves excessive data copying. For instance, conven- 
tional ORB’s often resize their internal marshaling buffers 
multiple times when encoding large operation parameters. 
Naive memory management implementations use a single 
buffer that is resized automatically as necessary, which can 
cause excessive data copying. 


JTAO’s memory management optimization techniques: 
TAO’s memory management optimizations leverage off the 
design of its concurrency strategies, which minimize thread 
context switching overhead and priority inversions by elimi- 
nating queueing within the ORB’s critical path. For example, 
on the client-side, the thread that invokes a remote operation 
is the same thread that completes the I/O required to send the 
request, i.e., nO queueing exists within the ORB. Likewise, 
on the server-side, the thread that reads a request completes 
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the upcall to user code, also eliminating queueing within the 
ORB. These optimizations are based on the principle pattern 
of exploiting locality and optimizing for the common case. 

By avoiding thread context switches and queueing, TAO 
can benefit from memory management optimizations based 
on thread-specific storage. Thread-specific storage is a com- 
mon design pattern [13] for optimizing buffer management 
in multi-threaded middleware. This pattern allows multiple 
threads to use one logically global access point to retrieve 
thread-specific data without incurring locking overhead for 
each access, which is an application of the pattern of avoiding 
waste. TAO uses this pattern to place its memory allocators 
into thread-specific storage. Using a thread-specific memory 
pool eliminates the need for intra-thread allocator locks, re- 
duces fragmentation in the allocator, and helps to minimize 
priority inversion in real-time applications. 

In addition, TAO minimizes unnecessary data copying by 
keeping a linked list of CDR buffers. As shown in Figure 16, 
operation arguments are marshaled into TSS allocated buffers. 
The buffers are linked together to minimize data copying. 
Gather-write I/O system calls, such as writ ev, can then write 
these buffers atomically without requiring multiple OS calls, 
unnecessary data allocation, or copying. TAO’s memory man- 


operation (‘paraml] , | param? 


llarge_param ]) 





marshal Lo pian ge gh # 
ORB buffers [Lal 1}+-—{ 1+ {1 7) 
writev () 
Gather 
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Figure 16: TAO’s Internal Memory Managment 


agement design also supports special allocators, such as zero- 
copy schemes [33] that share memory pools between user pro- 
cesses, the OS kernel, and network interfaces. 


Empirical results: Figure 17 compares buffer allocation 
time for a CORBA request using thread-specific storage (TSS) 
allocators with that of using a global allocator. These ex- 
periments were executed on a Pentium II/450 with 256Mb 
of RAM, running LynxOS 3.0. The test program contained 
a group of ORB buffer (de)allocations intermingled with a 
pseudo-random sequence of regular (de)allocations. This is 
typical of middleware frameworks like CORBA, where appli- 
cation code is called from the framework and vice-versa. Both 
experiments perform the same sequence of memory allocation 
requests, with one experiment using a TSS allocator for the 
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Figure 17: Buffer Allocation Time using TSS and Global Al- 
locators 


ORB buffers and the other using a global allocator. 

In this experiment, we perform ~16 ORB buffer allocations 
and ~1,000 regular data allocations. The exact series of al- 
locations is not important, as long as both experiments per- 
form the same number. If there is one series of allocations 
where the global allocator behaves non-deterministically, it is 
not suitable for hard real-time systems. 

Our results in Figure 17 illustrate that TAO’s TSS allocators 
isolate the ORB from variations in global memory allocation 
strategies. In addition, this experiment shows how TSS allo- 
cators are more efficient than global memory allocators since 
they eliminate locking overhead. In general, reducing locking 
overhead throughout an ORB is important to support real-time 
applications with deterministic QoS requirements [30]. 


3.3 Minimizing ORB Protocol Message Foot- 
print 


Motivation: Real-time systems have traditionally been de- 
veloped using proprietary protocols that are hard-coded for 
each application. In theory, CORBA’s GIOP/IIOP protocols 
obviate the need for proprietary protocols. In practice, how- 
ever, many developers of real-time applications are justifiably 
concerned that standard CORBA protocols will cause exces- 
sive overhead. For example, some applications have very strict 
constraints on latency, which is affected by the total time re- 
quired to transmit the message. Other applications, such as 
mobile PDAs running over wireless access networks, have 
limited bandwidth, which makes them more sensitive to pro- 
tocol message footprint overhead. 
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TAO’s ORB protocol optimization techniques: A GIOP 
request includes a number of fields, such as the version num- 
ber, that are required for interoperability among ORBs. How- 
ever, certain fields are not required in all application domains. 
For instance, the magic number and version fields can be omit- 
ted if a single supplier and single version is used for ORBs in 
a real-time embedded system. Likewise, if the communicating 
ORBs are running on systems with the same endianess, i.e., 
big-endian or little-endian, the byte order flag can be omitted 
from the request. 

Since embedded and real-time systems typically run the 
same ORB implementation on similar hardware, we have 
modified TAO to optionally remove some fields from the 
GIOP header and the GIOP Request header when the 
-ORBgioplite option is given to the client and server 
CORBA: :ORB_init method. The fields removed by this 
optimization are shown in Table 3. These optimizations are 
guided by the principle patterns of relaxing system require- 
ments and avoiding unnecessary generality. 


[MeaderFidd | ‘Sze 


GIOP magic number 4 bytes 
GIOP version 2 bytes 
GIOP flags (byte order) 1 byte 
Request Service Context > 4 bytes 
Request Principal > 4 bytes 


[Toa —~*d ST ytes | 


Table 3: Messaging Footprint Savings for TAO’s GIOPIlite Op- 
timization 

























Empirical results: We conducted an experiment to measure 
the performance impact of omitting the GIOP fields in Table 3. 
These experiments were executed on a Pentium II/450 with 
256Mb of RAM, running LynxOS 3.0 in loopback mode. Ta- 
ble 4 summarizes the results, expressed in calls-per-second: 


Marshaling Enabled Marshaling Disabled 
|_| "nin {mex [ng | an | me we | 
2,906 | 2,912 | 2,976 | 2,949 

2985 [2511 | 3.9 | 2967 | 









max 
GIOP 2,878 | 2,937 
GIOPlite | 2,883 | 2,978 
Table 4: Performance of TAO’s GIOP and GIOPIlite Protocol 
Implementations 





Our empirical results reveal a slight, but measurable, 2% 
improvement when removing the GIOP message footprint 
“overhead.” More importantly though, these changes do not 
affect the standard CORBA APIs used to develop applications. 
Therefore, programmers can focus on the development of ap- 
plications, and if necessary, TAO can be optimized to use this 
lightweight version of GIOP. 


To obtain more significant protocol optimizations, we are 
adding a pluggable protocols framework to TAO [34]. This 
framework generalizes TAO’s current -ORBgioplite op- 
tion to support both pluggable ORB protocols (ESIOPs) and 
pluggable transport protocols. 


4 Related Work 


Demultiplexing is an operation that routes messages through 
the layers of an ORB endsystem. Most protocol stacks models, 
such as the Internet model or the ISO/OSI reference model, 
require some form of multiplexing to support interoperabil- 
ity with existing operating systems and peer protocol stacks. 
Likewise, conventional CORBA ORBs utilize several extra 
levels of demultiplexing at the application layer to associate 
incoming client requests with the appropriate servant and op- 
eration (as shown in Figure 2). 

Related work on demultiplexing focuses largely on the 
lower layers of the protocol stack, i.e., the transport layer 
and below, as opposed to the CORBA middleware. For in- 
stance, [21, 35, 22, 36] study demultiplexing issues in com- 
munication systems and show how layered demultiplexing is 
not suitable for applications that require real-time quality of 
service guarantees. 

Packet filters are a mechanism for efficiently demultiplex- 
ing incoming packets to application endpoints [37]. A number 
of schemes to implement fast and efficient packet filters are 
available. These include the BSD Packet Filter (BPF) [38], 
the Mach Packet Filter (MPF) [39], PathFinder [40], demul- 
tiplexing based on automatic parsing [41], and the Dynamic 
Packet Filter (DPF) [36]. 

As mentioned before, most existing demultiplexing strate- 
gies are implemented within the OS kernel. However, to op- 
timally reduce ORB endsystem demultiplexing overhead re- 
quires a vertically integrated architecture that extends from the 
OS kernel to the application servants. Since our ORB is cur- 
rently implemented in user-space, however, our work focuses 
on minimizing the demultiplexing overhead in steps 3, 4, 5, 
and 6 (which are shaded in Figure 2). 


5 Concluding Remarks 


Developers of real-time systems are increasingly using off- 
the-shelf middleware components to lower software lifecycle 
costs and decrease time-to-market. In this economic climate, 
the flexibility offered by CORBA makes it an attractive mid- 
dleware architecture. Since CORBA is not tightly coupled to 
a particular OS or programming language, it can be adapted 
readily to “niche” markets, such as real-time embedded sys- 
tems, which are not well covered by other middleware. In this 
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sense, CORBA has an advantage over other middleware, such 
as DCOM [42] or Java RMI [43], since it can be integrated 
into a wider range of platforms and languages. 


The POA and ORB Core optimizations and performance re- 
sults presented in this paper support our contention that the 
next-generation of standard CORBA ORBs will be well-suited 
for distributed real-time systems that require efficient, scal- 
able, and predictable performance. Table 5 summarizes which 
TAO optimizations are associated with which principle pat- 
terns, as well as which optimizations conform to the CORBA 
standard and which are non-standard. 





|| Optimization | Principle Patterns Compliant || 


Request Precompute, Avoid waste 
demuxing Passing hints in header 
Relaxing system requirements 
Using specialized routines 
Not tied to reference models 
Adding extra state 


Object keys Avoid waste yes 
yo | 
Predictability | Relaxing system requirements yes 
ison | || 

Collocation Relax system requirements 
Avoid waste Lcd 
yes 
Avoid waste 
Optimize for common case 
Protocol msg | Avoid generality 
Jieagin [Reacaynemegutenens | | 


Add extra state 
Exploit Locality 

Table 5: Degree of CORBA-compliance for Real-time Opti- 

mization Principle Patterns 


















































Memory 
management 








Our primary focus on the TAO project has been to research, 
develop, and optimize policies and mechanisms that allow 
CORBA to support hard real-time systems, such as avion- 
ics mission computing [11]. In hard real-time systems, the 
ORB must meet deterministic QoS requirements to ensure 
proper overall system functioning. These requirements moti- 
vate many of the optimizations and design strategies presented 
in this paper. However, the architectural design and perfor- 
mance optimizations in TAO’s ORB endsystem are equally 
applicable to many other types of real-time applications, such 
as telecommunications, network management, and distributed 
multimedia systems, which have statistical QoS requirements. 

The C++ source code for TAO and ACE is freely available 
at www.cs.wustl.edu/~schmidt/TAO.html. This 
release also contains the ORB benchmarking test suites de- 
scribed in this paper. 


158 


5th USENIX Conference on Object-Oriented Technologies and Systems (COOTS '99) 


Acknowledgements 


We would like to thanks our COOTS shepherd, Steve Vi- 
noski, whose comments helped improve this paper. In addi- 
tion, we would like to thank the COOTS Program Committee 
and anonymous reviewers their constructive suggestions for 
improving the paper. 


References 


[1] Object Management Group, The Common Object Request Bro- 
ker: Architecture and Specification, 2.2 ed., Feb. 1998. 


[2] S. Vinoski and M. Henning, Advanced CORBA Programming 
With C++. Addison-Wesley Longman, 1999. 


[3] Object Management Group, Realtime CORBA 1.0 Joint Submis- 
sion, OMG Document orbos/98-12-05 ed., December 1998. 


[4] A. Gokhale and D. C. Schmidt, “Measuring the Performance 
of Communication Middleware on High-Speed Networks,” in 
Proceedings of SIGCOMM ’96, (Stanford, CA), pp. 306-317, 
ACM, August 1996. 


(5] I. Pyarali, T. H. Harrison, and D. C. Schmidt, “Design 
and Performance of an Object-Oriented Framework for High- 
Performance Electronic Medical Imaging,” USENIX Comput- 
ing Systems, vol. 9, November/December 1996. 


[6] A. Gokhale and D. C. Schmidt, “Measuring and Optimizing 
CORBA Latency and Scalability Over High-speed Networks,” 
Transactions on Computing, vol. 47, no. 4, 1998. 


[7] D. C. Schmidt, S. Mungee, S. Flores-Gaitan, and A. Gokhale, 
“Alleviating Priority Inversion and Non-determinism in Real- 
time CORBA ORB Core Architectures,” in Proceedings of the 
4'* IEEE Real-Time Technology and Applications Symposium, 
(Denver, CO), IEEE, June 1998. 


[8] S. Vinoski, “CORBA: Integrating Diverse Applications Within 
Distributed Heterogeneous Environments,” JEEE Communica- 
tions Magazine, vol. 14, February 1997. 


[9] D. C. Schmidt, D. L. Levine, and S. Mungee, “The Design and 
Performance of Real-Time Object Request Brokers,” Computer 
Communications, vol. 21, pp. 294-324, Apr. 1998. 


C. D. Gill, D. L. Levine, and D. C. Schmidt, “Evaluating Strate- 
gies for Real-Time CORBA Dynamic Scheduling,” The Inter- 
national Journal of Time-Critical Computing Systems, special 
issue on Real-Time Middleware, 1999, to appear. 


T. H. Harrison, D. L. Levine, and D. C. Schmidt, “The De- 
sign and Performance of a Real-time CORBA Event Service,” 
in Proceedings of OOPSLA ’97, (Atlanta, GA), ACM, October 
1997. 


F. Kuhns, D. C. Schmidt, and D. L. Levine, “The Design 
and Performance of RIO — A Real-time I/O Subsystem for 
ORB Endsystems,” in Proceedings of the 5‘* IEEE Real-Time 
Technology and Applications Symposium, (Vancouver, British 
Columbia, Canada), IEEE, June 1999, 


D. C. Schmidt and C. Cleeland, “Applying Patterns to Develop 
Extensible ORB Middleware,” JEEE Communications Maga- 
zine, April 1999. 


[10] 


(11) 


[12] 


[13] 


USENIX Association 


[14] 


(15) 


[16] 


[17] 


[18] 


[19] 


[20] 


[21] 


[22] 


[23] 


[24] 


[25] 


[26] 


[27] 


[28] 


[29] 


USENIX Association 


Alistair Cockburn, “Prioritizing Forces in Software Design,” 
in Pattern Languages of Program Design (J. O. Coplien, 
J. Vlissides, and N. Kerth, eds.), pp. 319-333, Reading, MA: 
Addison-Wesley, 1996. 


A. Gokhale and D. C. Schmidt, “Optimizing a CORBA IIOP 
Protocol Engine for Minimal Footprint Multimedia Systems,” 
Journal on Selected Areas in Communications special issue on 
Service Enabling Platforms for Networked Multimedia Systems, 
1999, 


G. Varghese, “Algorithmic Techniques for Efficient Protocol 
Implementations ,” in SIGCOMM '96 Tutorial, (Stanford, CA), 
ACM, August 1996. 


I. Pyarali and D. C. Schmidt, “An Overview of the CORBA 
Portable Object Adapter,” ACM StandardView, vol. 6, Mar. 
1998. 


D. C. Schmidt and S. Vinoski, “C++ Servant Managers for the 
Portable Object Adapter,” C++ Report, vol. 10, Sept. 1998. 


D. C. Schmidt and S. Vinoski, “Using the Portable Object 
Adapter for Transient and Persistent CORBA Objects,” C++ 
Report, vol. 10, April 1998. 


D. Schmidt and S. Vinoski, “Distributed Callbacks and Decou- 
pled Communication in CORBA,” C++ Report, vol. 8, October 
1996. 


D. L. Tennenhouse, “Layered Multiplexing Considered Harm- 
ful,” in Proceedings of the 1** International Workshop on High- 
Speed Networks, May 1989. 


Z. D. Dittia, J. Jerome R. Cox, and G. M. Parulkar, “Design of 
the APIC: A High Performance ATM Host-Network Interface 
Chip,” in JEEE INFOCOM '95, (Boston, USA), pp. 179-187, 
IEEE Computer Society Press, April 1995. 


D. C. Schmidt, “GPERF: A Perfect Hash Function Generator,” 
in Proceedings of the 2"* C++ Conference, (San Francisco, 
California), pp. 87-102, USENIX, April 1990. 


P. Jain and D. C. Schmidt, “Service Configurator: A Pattern 
for Dynamic Configuration of Services,” in Proceedings of the 
3°? Conference on Object-Oriented Technologies and Systems, 
USENIX, June 1997. 


E. Gamma, R. Helm, R. Johnson, and J. Vlissides, Design Pat- 
terns: Elements of Reusable Object-Oriented Software. Read- 
ing, MA: Addison-Wesley, 1995S. 


A. Gokhale, D. C. Schmidt, C. O’Ryan, and A. Arulanthu, “The 
Design and Performance of aCORBA IDL Compiler Optimized 
for Embedded Systems,” in Submitted to the LCTES workshop 
at PLDI '99, (Atlanta, GA), IEEE, May 1999. 


A. Gokhale and D. C. Schmidt, “Evaluating the Performance 
of Demultiplexing Strategies for Real-time CORBA,” in Pro- 
ceedings of GLOBECOM '97, (Phoenix, AZ), IEEE, November 
1997. 


A. Gokhale and D. C. Schmidt, “The Performance of the 
CORBA Dynamic Invocation Interface and Dynamic Skele- 
ton Interface over High-Speed ATM Networks,” in Proceed- 
ings of GLOBECOM ’96, (London, England), pp. 50-56, IEEE, 
November 1996. 


Object Management Group, Minimum CORBA - Joint Revised 
Submission, OMG Document orbos/98-08-04 ed., August 1998. 


[30] 


(31] 


[32] 


[33] 


[34] 


[35] 


[36] 


[37] 


[38] 


(39] 


[40] 


[41] 


[42] 


[43] 


5th USENIX Conference on Object-Oriented Technologies and Systems (COOTS '99) 


D. C. Schmidt, S. Mungee, S. Flores-Gaitan, and A. Gokhale, 
“Software Architectures for Reducing Priority Inversion and 
Non-determinism in Real-time Object Request Brokers,” Jour- 
nal of Real-time Systems, To appear 1999. 


D. C. Schmidt and S. Vinoski, “Developing C++ Servant 
Classes Using the Portable Object Adapter,’ C++ Report, 
vol. 10, June 1998. 


D. L. Levine, C. D. Gill, and D. C. Schmidt, “Dynamic Schedul- 
ing Strategies for Avionics Mission Computing,” in Proceed- 
ings of the 17th IEEE/AIAA Digital Avionics Systems Confer- 
ence (DASC), Nov. 1998. 


Z. D. Dittia, G. M. Parulkar, and J. Jerome R. Cox, “The APIC 
Approach to High Performance Network Interface Design: Pro- 
tected DMA and Other Techniques,” in Proceedings of INFO- 
COM '97, (Kobe, Japan), IEEE, April 1997. 


F. Kuhns, C. O’Ryan, D. C. Schmidt, and J. Parsons, “The De- 
sign and Performance of a Pluggable Protocols Framework for 
Object Request Broker Middleware,” in Submitted to the IFIP 
6'" International Workshop on Protocols For High-Speed Net- 
works (PfHSN '99), (Salem, MA), IFIP, August 1999. 


D. C. Feldmeier, “Multiplexing Issues in Communications Sys- 
tem Design,” in Proceedings of the Symposium on Communica- 
tions Architectures and Protocols (SIGCOMM), (Philadelphia, 
PA), pp. 209-219, ACM, Sept. 1990. 


D. R. Engler and M. F. Kaashoek, “DPF: Fast, Flexible Message 
Demultiplexing using Dynamic Code Generation,” in Proceed- 
ings of ACM SIGCOMM '96 Conference in Computer Com- 
munication Review, (Stanford University, California, USA), 
pp. 53-59, ACM Press, August 1996. 


J.C. Mogul, R. F. Rashid, and M. J. Accetta, “The Packet Fil- 
ter: an Efficient Mechanism for User-level Network Code,” in 
Proceedings of the 11‘* Symposium on Operating System Prin- 
ciples (SOSP), November 1987. 


S. McCanne and V. Jacobson, “The BSD Packet Filter: A New 
Architecture for User-level Packet Capture,” in Proceedings of 
the Winter USENIX Conference, (San Diego, CA), pp. 259-270, 
Jan. 1993. 


M. Yuhara, B. Bershad, C. Maeda, and E. Moss, “Efficient 
Packet Demultiplexing for Multiple Endpoints and Large Mes- 
sages,” in Proceedings of the Winter Usenix Conference, Jan- 
uary 1994. 


M. L. Bailey, B. Gopal, P. Sarkar, M. A. Pagels, and L. L. Pe- 
terson, “Pathfinder: A pattern-based packet classifier,’ in Pro- 
ceedings of the 1°* Symposium on Operating System Design and 
Implementation, USENIX Association, November 1994. 


M. Jayaram and R. Cytron, “Efficient Demultiplexing of Net- 
work Packets by Automatic Parsing,” in Proceedings of the 
Workshop on Compiler Support for System Software (WCSSS 
96), (University of Arizona, Tucson, AZ), February 1996. 


Microsoft Corporation, Distributed Component Object Model 
Protocol (DCOM), 1.0 ed., Jan. 1998. 


Sun Microsystems, Inc, Java Remote Method Invocation Speci- 
fication (RMI), Oct. 1998. 


159 


USENIX Association 


The Application of Object-Oriented Design Techniques to the 
Evolution of the Architecture of a Large Legacy Software System 


Jeff Mason (jeff.mason @xilinx.com) and Emil S. Ochotta (emil.ochotta@ xilinx.com) 
Xilinx Inc. 
2100 Logic Drive 
San Jose, CA 95124 


ABSTRACT 


Object Oriented Analysis and Design (OOAD) is 
increasingly popular as a set of techniques that can 
be used to initially analyze and design software. 
Unfortunately, OOAD is a relatively new concept 
and many large legacy systems predate it. This paper 
presents the approach one company followed in 
applying OOAD techniques to an existing 2.5 million 
line code base. We present an iterative process that 
provides an avenue for the software to evolve while 
balancing the needs of business and software engi- 
neering. Our case study reveals the many pitfalls 
that can derail a software re-engineering effort, but 
also shows promising initial results from continued 
perseverance in this effort. 


1. Introduction 


Object Oriented Analysis and Design (OOAD) 
techniques promise many benefits to software 
developers and software companies - software reuse and 
resilience to change through component libraries and 
patterns[1][2], lucid code structure that more clearly 
reflects the problem domain[3], and reduced risk by 
introducing a formal design process where often none 
existed previously[4] - to name only a few. To reap 
these rewards, most OOAD techniques assume software 
developers apply the techniques at the beginning of the 
software lifecycle, i.e., the beginning of the design 
process, and continue to use them as the software 
matures. Unfortunately, OOAD is a relatively new 
concept and many large legacy systems predate it. 
Moreover, because the pressures of commercial 
competition focus directly on adding features, fixing 
bugs, and releasing the product on-time, software 
developers often (misguidedly) skimp on the things that 
should be done for long term benefit in favor of the 
things that absolutely must be done to complete the 
product. Since not all developers are educated as to the 
benefits of OOAD it is often one of the things 
overlooked in the headlong rush to a software release. 
The long-term price of this behavior is a large body of 
difficult to maintain software, proving the well-known 
adage that the overriding cost of software is not its 
initial development but rather its maintenance. In this 
environment of legacy software and corporate pressure, 
reaping the benefits of OOAD seems a very elusive 
goal. 


This paper describes the process one company 
undertook to re-architect their large legacy software 
system and begin reaping the benefits of OOAD 
techniques despite the constraints of continuing feature 
improvements and a strict release schedule. This six- 
step process is as follows: 


1. Analysis: evaluating the current state of the legacy 
software; 


2. Goal Selection: determining a set of goals to guide 
changes to the software and allow evaluation of the 
results; 


3. Key Concept Selection: refining the goals into a set 
of key concepts based on business requirements, 
software engineering principles and object oriented 
analysis and design principles; 


4. Planning: determining how best to apply the key 
concepts to the legacy software to allow it to evolve 
towards a system that satisfies those concepts; 


5. Implementation: making it happen; and 


6. Measurement: evaluating the effectiveness of the 
changes against the original goals. 


There is a substantial body of research that focuses on 
the technical aspects of software evolution[5][6] and 
reengineering[7][8], and many of the technical ideas 
discussed in this paper have been described elsewhere in 
one form or another. The contribution of this paper is 
the description of the process we undertook and how we 
selected and satisfied key concepts that balanced the 
demands of business, the requirements of software 
engineering, and the OOAD principles we wanted to 
pursue. In the end, these key concepts included: 


¢ Autonomy: encapsulating and insulating function- 
ally related software into subsystems to minimize 
interactions, to reduce compile times, and to support 
testing, allowing these subsystems to evolve inde- 
pendently and asynchronously; 


* Sharing: solving problems in as few places and as 
few times as possible to maximize code reuse, mini- 
mize code size, and promote standardization; 


* Comprehensibility: promoting design, documenta- 
tion and coding standards that - for the general client 
- make shared code and interfaces easier to under- 
stand, more convenient to use, and easier to main- 
tain; 
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¢ Modularity: allowing functional product compo- 
nents to be released to end users independently and 
asynchronously; 


* Co-development: promoting the ability to explore, 
evaluate, and develop new features without affecting 
other on-going development; 


¢ Innovation: promoting runtime, memory, and qual- 
ity of results performance through optimization and 
innovation; 


° Testing: enabling efficient automated testing by cre- 
ating a levelizable system[9] (i.e., a system where 
the testing and compile-time dependencies between 
software modules form a directed acyclic graph); 


* Release: supporting a release model with fixed 
release dates planned long in advance. 


To implement these concepts, we developed a system 
architecture vision that outlined the changes to the 
software architecture that were designed to put these 
key concepts into practice. We then put forward an 
evolutionary plan to implement the vision. What quickly 
became apparent is that the inertia of the software was 
too large to allow all our changes to be implemented at 
once, while still releasing working software on an 
aggressive fixed schedule. Consequently, we realized 
that the six-step process described above must be 
applied iteratively, over an extended period of years. 


Since the full implementation of this vision is an 
ongoing task whose costs and benefits may not be fully 
evaluated for many years, this paper describes the initial 
iteration through that six-step process. In this first 
iteration the implementation had to be scaled back to fit 
within a single release cycle of less than a year and 
focussed primarily on the key concepts of autonomy, 
sharing, testing, and comprehensibility. In these areas, 
we have seen some dramatic improvements, particularly 
where quantitative measurement is straightforward, 
such as compile-time coupling. 


The remainder of this paper is organized as follows. In 
the next section we detail the first step in the six-step 
process we followed, outlining the state of the software 
system and the corporate situation that forms the 
backdrop for our work. In Section 3 and Section 4, we 
describe the next two steps in the process, the 
conceptual steps of setting the correct goals we are 
working toward and selecting key concepts that reflect 
those goals. In Section 5, we present the evolutionary 
plan we created to work towards realizing those key 
concepts in our software. In Section 6, we discuss the 
implementation of this evolutionary plan, and in 
Section 7, we evaluate this implementation against the 
key concepts and our initial goals. Finally, in Section 8, 
we present our conclusions. 


2. Background and Analysis 


In this section, we describe the first of the six steps in 
the process we followed to re-architect our legacy 
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software system. We first present the environment in 
which our work was performed, including a_ brief 
description of Xilinx, the company where the work was 
performed, and the purpose of the software. We then 
discuss the state of disrepair we found when we first 
began to look at the software system itself and the costs 
associated with that disrepair. These costs were the 
initial motivation that drove our re-architecture. 


2.1 Xilinx Inc. 

This work was performed at Xilinx[10]. Xilinx was 
created as a hardware company, producing FPGAs, 
which are members of the family of integrated circuits 
(ICs) called programmable logic. 


Understanding how an FPGA is used provides some 
useful insight into the complexity of the FPGA design 
software we discuss in this paper. An example FPGA 
application is emulating another IC or computer chip. In 
this application, the design to be emulated is loaded into 
the FPGA and the FPGA inserted into the system of 
which the chip being emulated is a part. This technique 
allows the design of the new chip and the system of 
which it is part to be tested and debugged before the 
new chip is actually built. Similar to a compiler, FPGA 
design software automatically translates the high-level 
description of the chip to be emulated into millions of 
programming bits that configure the FPGA to perform 
the emulation. Part of this translation task involves 
selecting a location from among the thousands available 
on the FPGA for each logical element. These locations 
must be selected to optimize chip performance or other 
user-specified constraints, creating an NP-complete[11] 
combinatorial optimization problem[12]. Moreover, in 
response to competition and customer demand, FPGAs 
are continually increasing in size and new hardware 
features are added to each new FPGA[13]. To keep pace 
with these newer, bigger FPGAs and still provide new 
software features, the FPGA design software is 
increasing in size and complexity at an even faster rate. 
Finally, because software provides the abstract model 
with which most FPGA customers interact, Xilinx has 
put increasing emphasis on software development in 
order to turn our software into a competitive advantage. 
The difficulty of the FPGA optimization problem, 
continually evolving FPGA hardware, and increasing 
customer reliance on fast reliable software conspire to 
make writing FPGA design software a challenging 
proposition. 


2.2 The State of Xilinx Software 

As one step toward improving its software, in early 
1995 Xilinx acquired a small software start-up company 
based in Colorado. At that time, Xilinx’ FPGA design 
software consisted of nearly 1.5 million lines of C code 
developed and maintained by approximately 70 staff 
members. Xilinx had released 30 software revisions to 
over 10,000 software customers. In contrast, the 30 
engineers of the close-knit start-up had written just over 
700,000 lines of highly interconnected C++ and had 


USENIX Association 


USENIX Association 


released 6 software revisions to about 200 customers. 
The start-up’s code was poorly documented, but a 
knowledgeable person was always at hand to deal with 
any issue or problem. Thus, change requests were 
informal conversations and system-wide changes could 
be implemented and compiled within a few hours. 


After the purchase, the corporate goal was to merge the 
two software systems, keeping the strengths of both. 
This was easier said than done. The C++ code from the 
start-up was selected as the software base for the future 
merged product, and features that had been added to the 
original Xilinx product based on customer requests were 
to be added as needed. Software developers in Colorado 
were now faced with a much larger development 
environment and had to work with developers in 
California who understood the features to be added but 
did not understand the software base. Software 
developers in California were now faced with giving up 
their old software, learning a new and undocumented 
software base and working with developers in Colorado 
who did not understand the new features to be added. 
Neither group was used to working across multiple 
development sites, so “lack of communication” was one 
of the most common complaints by both groups about 
their peers on the other side of the mountains. Software 
was not getting built on time and fingers were being 
pointed in all directions. It was a difficult time for all 
involved. 


Work toward the first merged release took substantially 
longer than anyone had dared to predict, and we missed 
several target release dates. Upper management began 
to apply greater pressure to the software team, justifying 
decisions to take “short-cuts” on the basis of short-term 
necessity. As is frequently the case, it is arguable 
whether these “short-cuts” reduced the time to first 
customer shipment, but they unquestionably came back 
to haunt us by adding to our maintenance burden over 
the next few releases. 


After the frenzied days and nights of making our first 
few merged releases a reality, we took stock of our new 
software. The start-up’s 700,000 lines of C++ had 
ballooned to approximately 2.5 million lines of C++ 
code in roughly 2200 source and header files. Our 
software shipped as 45 executables, 130 shared libraries 
(loaded at program start up), and 110 dynamically 
loaded libraries that customized the software for the 
different FPGAs in the Xilinx product line. Our single 
source software supported the Solaris, Windows, HP 
and RS6000 platforms. The source code was organized 
into approximately 400 subdirectories called packages, 
where each package produced either a library or an 
executable. After a brief inspection, we identified 
several major problem areas that we later categorized 
according to the key concepts to which they relate: 


* Comprehensibility. Just as it was in the start-up, the 
code was mostly undocumented, but now it was 
much more complex and growing so rapidly that it 


was no longer possible to find any one person who 
understood most of it. 


Autonomy (Encapsulation). The _ interfaces 
between packages had evolved as necessary to meet 
tactical, local needs, without regard for strategic, 
system-level concerns. Consequently interfaces were 
extremely broad and ill defined. There was no clear 
division between the interface and the implementa- 
tion of most classes. Much of the code was really just 
old C code transformed into C++ objects. One of the 
major indicators of a lack of encapsulation was direct 
access of class data by another class. Many of our 
classes had been designed with public get/set func- 
tions for each of the class data members. Conse- 
quently, changes that should have been internal to a 
package had repercussions throughout the system. 


Autonomy (Insulation). The compile-time depen- 
dencies (due to included files) had never been 
designed or analyzed. Often the vast majority of the 
compilation time for a module consisted of reading 
and processing included files. When first designing 
C++ classes, the tendency is to make the header file 
as convenient as possible for the implementation of 
that header. For example, the lowest level header file 
in the Xilinx software system directly or indirectly 
included almost 60 system header files, establishing 
a platform independent interface to the operating 
system. However, in such a large system, most of 
this functionality was not used by most of the clients 
that included it. This overhead is an unneeded bur- 
den to clients, who often compile complete defini- 
tions of many unused classes or classes that require 
only a forward reference. Engineers, recognizing this 
system-wide problem but unable to change it, were 
starting to make extremely large source files because 
the compilation times were faster than the aggregate 
compilation time of many smaller files. 


Autonomy (Insulation). Another problem was the 
rampant use of ‘inline’ functions. Inline functions are 
expanded at compile time rather than run time. This 
implies that a class that defines an inline function can 
not change the implementation of that inline function 
without forcing all of its clients to recompile. 


Autonomy. The turn around time to build and verify 
our software had become one to two weeks. Most of 
that time was spent in tracking down integration 
problems and then rebuilding everything. Because of 
the interdependence of the software, a compilation 
problem in one package may actually be caused by 
interface problems in any one of a large number of 
packages. Tracking down and solving these integra- 
tion problems was made even more difficult because 
finding someone who understood the disparate parts 
of the system was no longer possible. Because of the 
difficulty of compiling several million lines of code 
on a single workstation, developers typically devel- 
oped and tested against builds that were several 
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weeks out of date, exacerbating the integration prob- 
lems for the next build. 


¢ Sharing and Autonomy. There was no person or 
group whose responsibility it was to review or co- 
ordinate code changes. Each engineer or group was 
free to implement or use what they needed to get 
their specific job completed. Sometimes system- 
wide integration builds failed because large changes 
were made to shared code to support a new feature, 
but the changes were not tested for all clients. Other 
times, when small changes to a large package were 
required, engineers would copy the entire package 
into their package to avoid having to work with the 
other package's owner. 


Problems like these were creating a software and 
corporate environment where developers no longer had 
the freedom or time to innovate. They had no freedom 
because every non-critical project was deemed high 
risk, since the complex package interdependencies 
could cause minor errors to have major repercussions 
throughout the system. They had no time because fixing 
each small problem required an inordinate amount of 
time to implement and verify. 


After this analysis, it was clear that something needed to 
change. Fortunately for Xilinx, senior management 
understood the issues and that the long-term viability of 
the software product was at stake. With their support, 
several members of the company were chartered with 
re-architecting the software to fix these problems. 


3. Goal Selection and The System Architec- 
ture Committee 

The software management team recognized that Xilinx' 
software needed significant re-design at the architectural 
level, requiring co-operation from the entire software 
organization. They created the System Architecture 
Committee, a seven member team of engineers and 
managers that represented both development sites. The 
VP of software was a member of the committee, giving 
it the needed management clout. The authors were 
selected as members of this committee. 


Initially, it was thought that members of the committee 
would spend roughly 10% of their time looking at 
system architecture issues, but as the weeks passed and 
the extent of the problem became more clear, the work- 
load quickly grew beyond 10% of each member's time. 
To give the architectural work the attention it required, 
the authors became full time architects, and most 
members were required to put aside their other duties 
for short periods of time to complete work for the 
committee. 


In the first few weeks of meetings, little was 
accomplished and frustrations grew. Several members 
proposed changes that they felt would improve the 
existing software architecture, but the group could not 
reach consensus. Eventually, it became clear that the 
goals of the various members of the committee were 
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inconsistent, which led to disagreement over the 
changes that were required, which in turn led to 
stalemate and inaction. Consequently, the committee 
had to agree on its goals before it could take any steps to 
improve the software architecture. Choosing the goals 
was the seed for the six-step process that we eventually 
followed to bring our architectural changes to fruition. 


The committee agreed upon six goals, several of which 
conflicted, making them impossible to satisfy all the 
goals simultaneously. Initially the group was 
disheartened that we could not select a set of goals that 
could be satisfied completely, but over time it became 
clear that this tension between the goals reflected the 
reality of business and of the software design process. In 
both environments, there are no right answers and 
compromise is essential to success. Moreover, the 
ability to strike the correct balance between competing 
goals is what distinguishes successful businesses and 
software organizations, and this made the design 
process challenging and exciting. 


After much discussion and negotiation, the committee 
agreed upon the following six goals: 


¢ Provide superior end user productivity. Make inter- 
nal architectural improvements that eventually result 
in customer visible improvements in our software. 
Xilinx customers are the first priority. 


* Distribute productivity effectively across develop- 
ment groups. Address “geography problems,” where 
developers in different groups do not communicate. 
The developers who were originally in the start-up 
felt they could not get their work done due to contin- 
ually having to educate the other developers. The 
other developers felt they could not get their work 
done because they were not trusted to modify the 
existing core of the software. In practice, geography 
problems can happen even when the groups are 
physically adjacent, and the software architecture 
can have a significant effect on inter-group commu- 
nication. These problems have a large negative 
impact on productivity and morale. 


° Improve the productivity of individual developers. 
Create an environment where individual contributors 
can work more efficiently, without having to wait for 
other developers to complete their tasks. 


¢ Enable parallel development targeting multiple 
release dates. Develop an environment that supports 
projects that require more time than a single release 
cycle. This goal is a direct result of the fixed release 
schedule required to support new FPGAs in a timely 
fashion. 


° Build in flexibility to handle a constantly changing 
market. Anticipate the aspects of the software that 
will most likely change: new kinds of FPGAs, new 
software features, etc. Ensure that the software archi- 
tecture is not brittle when these kinds of changes are 
required. 
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* Enable accurate and efficient measurement of the 
quality of the system by designing for testability. 
Plan from the outset to incorporate a testing infra- 
structure that supports measurement of software 
quality that is both fast and accurate. 


These six goals formed the foundation for the rest of our 
software re-architecture work. They were driven 
primarily by business rather than OOAD or software 
engineering goals. When creating them, we also 
explicitly decided not to consider how we would 
accomplish these goals. They are merely what we 
wanted in an ideal world. Consequently, they form an 
ideal set of metrics with which we can evaluate the 
efficacy of our software architecture decisions. 


4. The Key Concepts 

Once the goals were in place, the next step was to 
determine how to achieve them. We soon realized that 
there was too large a semantic leap from the goals to 
actual architecture and code changes. What was needed 
was an intermediate step where we agreed on a set of 
principles from the worlds of OOAD and software 
engineering. These principles would reflect the above 
goals but more closely relate to the software and code 
architecture itself. This tighter relationship to the 
software would make it possible to create an 
implementation plan. 


These principles are the eight key concepts introduced 
in Section | that tie our architecture work together. The 
first two (autonomy and sharing) are primarily OOAD 
techniques, and the last (release) is a business 
constraint. The others lie somewhere on the spectrum of 
OOAD techniques and plain old software engineering. 
As with the goals, these concepts are in tension: any 
plan will favor some concepts over the others. In this 
section we describe these key concepts and how they 
connect the six goals to changes that can be realized ina 
software architecture. 


4.1 Autonomy 

¢ Autonomy: encapsulating and insulating function- 
ally related software into subsystems to minimize 
interactions, to reduce compile times, and to support 
testing, allowing these subsystems to evolve inde- 
pendently and asynchronously. 


Autonomy follows directly from both of the 
productivity goals: distribute productivity effectively 
across development groups and improve _ the 
productivity of individual developers. Engineers are 
most efficient when they are free from dependencies 
and allowed to work alone or as members of a small, 
tightly-knit group. 


Software dependencies can be exacerbated by poor 
software architecture and by failing to adhere to OOAD 
basics. For example, failing to encapsulate a data 
structure means that clients of a package use that data 
structure directly. When the data structure changes, the 
client must also change. This is an example of poor 


autonomy, since both the client and the supplier may be 
forced to wait for each other. The supplier may not be 
allowed to change the data structure until the client is 
ready, or the client may be unable to compile code that 
requires the new data structure until the supplier 
completes its implementation. 


We recognize two facets of autonomy that are closely 
linked to OOAD principles: insulation and 
encapsulation. Insulation can be defined as the process 
of avoiding or removing unnecessary compile-time 
coupling[9]. In practice, insulation can be implemented 
by creating an opaque interface. For example, Lakos 
defines a fully insulated class as one that is not derived 
from another class, contains no inline functions or 
default arguments, and contains only a single pointer to 
an implementation class that is declared with a forward 
reference. The details of the implementation class are 
completely hidden from any clients that include the 
fully insulating class. The effect of full insulation is to 
create header files that are completely independent of 
each other, dramatically reducing the compile-time 
overhead of header file inclusion. 


Another facet of autonomy is encapsulation, which 
should be familiar to practitioners of OOAD. 
Encapsulation can be defined as the concept of hiding 
implementation details behind a procedural interface[9]. 
Encapsulation and insulation are clearly related, but a 
fully insulated class need not be encapsulated. For 
example, a fully insulated class can still expose its 
implementation by providing public access functions to 
all its private data. However, in some respects 
encapsulation can be a less drastic technique than 
insulation because encapsulation allows the use of other 
features of C++, such as inheritance and _ inline 
functions. In this paper we refer to insulation when 
discussing the compile-time independence of modules 
from one another and refer to encapsulation when 
discussing the logical independence of a client class 
from the implementation decisions of its suppliers. 


4.2 Sharing 


¢ Sharing: solving problems in as few places and as 
few times as possible to maximize code reuse, mini- 
mize code size, and promote standardization. 


Sharing falls into the general category of software reuse, 
a subject frequently discussed in the literature (see for 
example[15]). Reuse or sharing is also connected to the 
goal of developer productivity because in principle it 
allows a piece of code to be written once and reused in 
several places. In practice, sharing is difficult to achieve 
because the clients of the shared code must agree on 
what exactly the code does. If the code is too 
specialized, it is unlikely to be useful to more than one 
client. On the other hand, if the code is too general, it 
will be too slow or so simple that reusing it 
accomplishes little. 
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For the Xilinx software system, two kinds of sharing or 
reuse are of particular interest. Because Xilinx supports 
a number of different hardware devices that are 
fundamentally related, the Xilinx software is an 
excellent candidate for sharing via domain 
engineering[16]. In domain engineering, tasks that are 
needed throughout the domain are abstracted and 
written once. In this case, common tasks needed to 
support all devices can be abstracted and written as 
configurable or data driven algorithms. Fortunately, this 
characteristic had been recognized by the original 
designers of the core software created in the start-up and 
the software already made significant use of this kind of 
sharing (although the term domain engineering had yet 
to be coined). 


The second kind of sharing was not as well supported in 
the Xilinx software, and that is the more conventional 
sharing of small generic algorithms. In sharing of this 
kind, tasks that are not domain specific, but may be 
general mathematical functions, data structures, or other 
algorithms are collected into a reusable library. To 
succeed at this sharing, this library has to be carefully 
designed explicitly so that it can be reused. The 
designers have to pay particular attention to making the 
functionality general, efficient, and well documented. 


4.3 Comprehensibility 


¢ Comprehensibility: promoting design, documenta- 
tion and coding standards that - for the general client 
- make shared code and interfaces easier to under- 
stand, more convenient to use, and easier to main- 
tain. 


Comprehensibility as defined by this key concept is not 
intended to increase the amount of communication 
between developers, but to reduce the need for it. This 
key concept again relates back to the productivity goals. 
The idea is to create a system and an environment that 
inherently reduces the need for additional documents to 
describe the architecture of the system itself. One of the 
main benefits of such a system is the reduced need for 
maintenance that can occur when a change to an 
interface must be made both in code and in one or more 
separate documents. 


4.4 Modularity 


¢ Modularity: allowing functional product compo- 
nents to be released to end users independently and 
asynchronously. 


Modularity is related primarily to the goal of superior 
end user productivity, but is an existing strength 
characteristic of the Xilinx software. As an example of 
this key concept, software support for a single Xilinx 
device could be shipped as part of the overall software 
system or as an individual software plug-in. This 
modularity made it possible to create and support new 
hardware products without shipping a complete new 
software system. 
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4.5 Co-development 

¢ Co-development: promoting the ability to explore, 
evaluate, and develop new features without affecting 
other on-going development. 


The key concept of co-development has two aspects that 
relate to what is being developed concurrently. Both 
relate to the goal of flexibility in a constantly changing 
market. In the case of support for new hardware devices, 
co-development means that new hardware can be 
supported with a minimal impact on software. This is 
essential in a competitive marketplace where the most 
successful company is the one that can innovate and 
respond to change the most quickly. Similarly, the other 
aspect of co-development is support for features and 
changes that are not driven by hardware, but must be 
developed somewhat independently from the main body 
of software because they extend beyond a single release 
cycle. 


4.6 Innovation 


¢ Innovation: promoting runtime, memory, and qual- 
ity of result performance through optimization and 
innovation. 


The innovation concept follows directly from the goal 
of providing superior end user productivity, which is 
fundamentally tied to software performance. Superior 
performance can be achieved using two methods that 
are in tension. The first method is optimization. This can 
be thought of as tuning existing software to improve its 
runtime, memory, or quality of result performance. 
Tuning software can sometimes compete with OOAD 
design principles such as encapsulation. For example, 
exploiting the underlying implementation of a data 
structure can sometimes result in _ significant 
improvements in performance, but at a clear cost in 
encapsulation. 


The second method to improve performance is 
algorithmic change. For example, a developer may be 
able to squeeze a few percentage points of improvement 
out of a bubble sort algorithm by changing array 
operations to pointers and making function calls in-line. 
However, changing to a quick sort will yield 
significantly greater runtime improvements for large 
datasets because quicksort has better algorithmic 
complexity. 


The two methods of improving performance are in 
tension because detailed optimizations that increase 
coupling of client algorithms to supplier algorithms also 
make it extremely difficult to innovate by changing 
either the client or the supplier algorithm. In most cases, 
algorithmic innovation yields greater improvements 
than optimization, so the focus of this key concept is on 
enabling innovation. 


4.7 Testing 


¢ Testing: enabling efficient automated testing by cre- 
ating a levelizable system[9] (i.e., a system where 
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the testing and compile-time dependencies between 
software modules form a directed acyclic graph). 


The testing concept corresponds directly to the 
testability goal. In this case, the concept has a technical 
definition that can be concretely evaluated. By building 
a graph from the compile-time dependency structure, 
the system can be evaluated to see if it is levelizable. If 
there are any loops in the dependency graph, the system 
is not levelizable and is more difficult to test. This is 
because all modules involved in a loop must be tested 
together as a single unit. In the worst case, all modules 
will be involved in a loop and the entire system must be 
treated as a monolithic black box for testing. Since the 
difficulty of testing a module grows exponentially with 
the size of the module, creating a levelizable system is a 
desirable property. In a large system such as the Xilinx 
software system, it is easy to accidentally create a 
dependency that creates a loop in the compile-time 
dependency graph. The size of the system also makes 
such loops especially expensive in testing time. 


4.8 Release 
* Release: supporting a release model with fixed 
release dates planned long in advance. 


The release concept is closely tied to the goal to enable 
parallel development targeting multiple release dates. 
An additional aspect of the release goal is to force the 
development to happen gradually in an evolutionary 
fashion. By requiring customer releases on a fixed 
schedule, the development is forced into an evolutionary 
path, which reduces schedule risk. 


In summary, the creation of these goals and key 
concepts was a long and arduous process. However, 
because the key concepts provided techniques to realize 
the goals, subsequent work went significantly faster. 
Each new idea could be readily compared with the goals 
and concepts we had already agreed to implement, 
helping to keep the re-architecture process on track. 


5. Planning 

Armed with the newly created sets of goals, the key 
concepts, and a common mindset, the system 
architecture committee began to look at the software 
and come up with concrete plans for what should be 
changed. Here again, we followed a process that is clear 
in retrospect but at the time seemed full of bumps and 
blind alleys. We began by evaluating the current 
architecture against the goals and key concepts. We then 
chose the key concepts to be given first priority in the 
redesign effort. Based on the highest priority key 
concepts we proposed several different architectural 
solutions intended to address these concepts, then 
collected the best features of these proposals into a 
coherent document called the system architecture 
vision. With the vision as an endpoint, we created a plan 
to evolve from the starting point of our existing software 
architecture. Finally, we imposed the constraints of 
having to support new FPGAs, add new software 






Control 
Flow 


Figure 1. The Personality Module (PM) contained device- 
specific code that plugged in to the base code, but control 
flow was determined by the base. 


features, fix bugs, and work with limited resources and a 
fixed release date. Considering these constraints, we 
then extracted a detailed short-term plan that would get 
us through the current release cycle. In this section we 
describe each of these planning phases in greater detail. 


5.1 Prioritizing the Key Concepts 


Before we could come up with an implementation plan 
of attack, we first needed to prioritize among the key 
concepts and decide which of them needed the most 
attention. To do this, we performed a careful evaluation 
of the existing software architecture against the goals 
and key concepts we had labored over for so long. This 
task allowed us to see which of the key concepts was 
least supported in the current software, and then to 
decide how we should focus our redesign efforts. 


Based on this analysis, autonomy emerged as the most 
important of the key concepts to guide changes to the 
system. Secondary emphasis was given to sharing, 
testing, and comprehensibility. This ordering did not 
discount the importance of the other concepts - indeed 
the final solution would need to balance among all eight 
- but it recognized that the existing architecture already 
had certain strengths. The existing source code 
architecture was composed of two levels of hierarchy. 
The first level, called the Personality Module (PM) 
reflected the hardware device supported by that part of 
the software. Each PM contained packages, grouping 
the software within a PM by logical function. The 
remainder of the code, shared by all PMs, was called the 
“base”. This organization inherently supported sharing 
and re-use of base code by all other PMs. Moreover, 
each package from a PM created a Dynamically Loaded 
Library (DLL) that was loaded on demand, once the 
base code determined the device and required functions. 
As shown in Fig. 1, the DLL for the PM plugged into 
the base software, customizing it for a given hardware 
device. This meant that if the base required no changes 
an entire PM could be developed independently of the 
rest of the system (the co-development concept) and 
shipped to customers separately from the rest of the 
software (the modularity concept). Finally, much of the 
tight coupling in the system was done in the name of 
performance optimization. This tight-coupling was a 
two edged sword however, allowing significant 
performance gains via detailed optimization on one 
hand but on the other hand stifling the creation of new 
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algorithms that promised leaps in performance. Here 
again we thought that increased autonomy was the key 
as it could increase encapsulation and make it easier to 
innovate algorithmically. 


A secondary focus was the need for additional sharing. 
As already described in Section 4.2, the combination of 
base and personality module was an ideal situation for 
domain engineering, so there was significant sharing 
because the base code was reused for every device. 
However, there was no natural place in the system for 
algorithms and data structures that did not belong to any 
particular PM, yet did not define a new application for 
the base. Additional sharing of these generic algorithms 
was a secondary consideration for the architectural 
redesign. 


Significant additional improvement was also desired in 
the area of testing. Within the existing architecture, 
anything not in a PM was added to the base, resulting in 
a very large base. Within the base, there were no rules 
about compile-time dependence and several packages 
were involved in compile-time loops. Also, we 
determined that significant gains in testability could be 
achieved by re-factoring the software according to 
function and designing from the outset a system that 
could be tested incrementally in sections. 


Finally, we sought improvement in comprehensibility. 
Because the system had evolved into more than 400 
packages, it was impossible to find a single person who 
had even a cursory understanding of the role of each 
package in the system. This made learning the system 
difficult for new developers, made tracking down 
integration problems difficult, and made it almost 
impossible to consider any large scale decisions about 
system structure. 


5.2 The System Architecture Vision 


Based on our priorities for the key concepts, we began 
to create a vision of where the software should be 
headed to better address autonomy, sharing, testing, and 
comprehensibility. This vision contained several 
elements and extended into the far future. Consequently, 
only two of these elements played a significant role in 
this first iteration through the six-step architecture re- 
design process. These elements were the re-organization 
of the source code into subsystems and layers and the 
creation of a special layer for generic algorithms. This 
section discusses these concepts in further detail. 


Many long hours of discussion went into the creation of 
the vision document. After failing to write anything 
collectively, we delegated the task of an initial vision to 
one committee member. Having this draft allowed us to 
work through many problems and refine the concepts. 
After several iterations we completed an initial draft. 
We further refined the document based on feedback 
from a group of the top engineers in the software 
organization. Finally, the first version of the system 
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architecture vision document was published and 
presented to all engineers. 


However, the planning work did not stop there. The 
vision encompassed work that could take years to 
complete, but the next release was less than one year 
away. Moreover, with each release we had to support 
the latest hardware devices and offer improvements in 
features, runtime and software quality. After many 
hours of negotiation with marketing, sales, application 
support and senior management we reached a consensus 
on the resources that could be spent on re-architecting 
the software system. Matching available resources 
against the system architecture, we determined that we 
could make two major architectural changes: the 
addition of support for generic algorithms and the re- 
structuring of the software into layers and subsystems. 
Of these, the re-structuring of the software was a 
significantly larger investment. We describe these two 
changes in turn. 


5.2.1 Layers and subsystems 

Re-structuring the system into subsystems and layers 
called for a complete re-organization of the system from 
PMs and packages. It called for the creation of new units 
of functionality called subsystems, which would in turn 
be collected into layers. 


On the surface, the software was still organized into a 
two-level hierarchy. However, the subsystems were 
envisioned as very different from the packages they 
replaced. These differences included the following: 


« A Subsystem is typically larger than a package and 
can produce multiple libraries or executables. The 
structure within a package was flat, but a subsystem 
is truly hierarchical. 


* Subsystems are logically related pieces of code that 
can have multiple people working on them. A single 
subsystem contains the base code and all the PM 
code for a single application or function. 


e A subsystem provides a single directory that con- 
tains the subsystem interface, i.e., any files exported 
by that subsystem, including header files, data files 
and libraries. Other subsystems can access only 
those files explicitly exported. 


* Developers are encouraged to encapsulate and fully 
insulate their subsystem interface. 


° New code in subsystems must follow a naming con- 
vention that limits pollution of the global namespace. 


* Subsystems can have a compile-time dependency 
upon another subsystem only if that subsystem is in 
the same layer or a layer listed as a supplier layer. 
The graph of compile-time dependencies within a 
layer must not contain any loops. 


* Part of the subsystem interface is a subsystem defini- 
tion document that describes the subsystem and each 
exported header file. To avoid synchronization prob- 
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Figure 2. Include file relationships of Layers as put for- 
ward in the SAV document. 


lems, the header file documentation is generated 
from the source code by ccdoc[14]. 


¢ The interface to a subsystem is controlled and 
changes require the approval of a committee. Syn- 
chronization of the changes within subsystems is 
handled differently than synchronization of changes 
in the subsystem interface. Each subsystem has a 
subsystem integrator, a new role that makes sure the 
different engineers working on the subsystem com- 
municate. By having a single person responsible for 
each subsystem we expect to eliminate some, but not 
all, of the final build integration problems. 


Of the changes proposed in the system architecture 
vision, these changes were perhaps the most directly 
connected with OOAD techniques. 


Subsystems are grouped into layers. A layer is a set of 
subsystems with certain compile-time dependency or 
access rules. All subsystems within a layer have to abide 
by the access rules of that layer. As shown in Fig. 2, the 
entire Xilinx software system is composed of ten layers. 
Five layers contain code that originates within Xilinx, 
and the other five layers contain code that originates 
outside of Xilinx. In the figure, the arrows show access. 
For example, of Xilinx subsystems, only those in level | 
can directly include system header files. The principal 
purpose of access rules is to improve testability. 
However, the access rule that limits access to the system 
header files has the added benefit of making it easier to 
port to a new development platform. 


5.2.2 Generic Algorithms 
The second change to the software architecture is also 
apparent in Fig. 2. This is the addition of the generic 


algorithms layer. The generic algorithms layer 
encourages code sharing and reuse for numerical 
algorithms, data structures, and other generic 
algorithms. In addition to the code sharing between base 
and PM code (now contained within a subsystem) the 
intent was that the generic algorithms could be used for 
varied tasks throughout the software. 


5.3 How the Architectural Vision Implements 
the Key Concepts 


Recall that for this first iteration of improvements to our 
software architecture, we focussed on the key concepts 
of autonomy, sharing, testing, and comprehension. To 
work toward better support for these concepts, we chose 
to implement two major changes from our system 
architecture vision: a generic algorithms layer and the 
re-organization of the system into subsystems and 
layers. In this section we describe how the four key 
concepts on which our re-architecture effort focussed 
were realized through these two major architectural 
changes. 


5.3.1 Autonomy 

Since our engineering organization was split between 
two distinct sites, and a number of remote engineers 
were also involved, we felt that a rearrangement of our 
software along functional lines could help us provide 
better support for autonomous code development. 
Repackaging the software into subsystems created 
modules that are more self-contained and can be worked 
upon independently of other pieces. 


Where modules are dependent on one another, greater 
stability is ensured by encapsulation, insulation, and 
committee approval for interface changes. Although 
some engineers felt these restrictions on the interface 
were overly harsh, we decided the best way to control 
the overall software structure was to control the 
subsystem interfaces. Ideally, we would also guarantee 
that the exported functionality behind the interface was 
also stable. For example, we could try to ensure that the 
return codes for a function do not change. However, in 
practice this is impractical and we rely on engineers to 
handle this level of detail. 


Encouraging developers to use encapsulation and 
insulation techniques for their subsystem interface was a 
direct step towards improving those aspects of 
autonomy. As discussed in Section 4.1, complete 
insulation would forbid techniques such as_ inline 
functions, inheritance and default arguments[9]. We 
performed some simple tests and found that in certain 
cases changing a small set of inline functions to be 
called functions could cause a significant run time 
penalty. Since several of our applications can run for 
many hours, this performance penalty was not 
acceptable. Similarly, we have a reliance on inheritance 
that is used as the principal mechanism for writing code 
in dynamically loaded libraries. Because of these 
constraints, we could not require full insulation for all 
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subsystem interfaces, but left it initially to engineer’s 
discretion, subject to external review. 


We further limited compile-time coupling by putting 
access rules on layers. The access rule with the greatest 
impact was that for Xilinx code, only the bottom layer 
could directly include system header files. This made 
designing the bottom layer more difficult because its 
exported files also could not include system header files, 
otherwise the system header files would be included 
indirectly by unknowing clients and insulation would be 
lost. We already had a set of system utilities that 
supported most of the system functions that might be 
needed. By making this rule we could speed up 
compilations and decouple the majority of our software 
from operating system quirks. 


Encapsulation in our definition also includes keeping 
the global name space clean. In the absence of 
namespace support on every platform, every exported 
symbol must be unique across the system. This requires 
the creation of a system wide naming convention that 
can be uniformly applied. This also requires restrictions 
on the use of certain compiler features, like #define, that 
can potentially cause conflicts between subsystems. We 
looked for a tool that could be smart enough to help 
automate this clean-up process, but could not find one. 
In the end, we decided that much of what could be done 
in this area was impractical within the time and resource 
constraints of the initial re-architecture into subsystems 
and layers. 


Consequently, although we did create a system wide 
naming convention that applied to all new code, we 
grandfathered existing code. This can lead to 
inconsistent header files that contain classes that follow 
different naming schemes. To compensate, we allowed 
the use of typedefs to make all classes within a 
subsystem consistent with the naming convention, but 
did not require clients of the existing interfaces to 
change. We determined that certain naming conventions 
had to be followed to avoid run time errors and required 
that these be followed, but other than that made few 
changes to the existing names of classes. 


Despite these few exceptions left until later releases, we 
expect the introduction of subsystems and all they imply 
to lead to substantial improvements in autonomy. 


5.3.2 Sharing 

The introduction of subsystems and layers does little to 
effect code sharing. The base/PM relationship that 
implements sharing in the manner of domain 
engineering[16] is still supported by the new 
architecture. Now however, the base and its PM code 
are contained within a single subsystem. 


The creation of the generic algorithms layer is aimed 
directly at improving the other kind of sharing, sharing 
of smaller more general purpose code. In this regard, the 
architecture changes are expected to lead to 
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significantly more code sharing of these generic 
algorithms. 


5.3.3 Testing 

In the previous Xilinx software architecture, the 
separation of PM code from base code made it difficult 
to independently test the software. This is because the 
PM created a dynamically loaded library that was 
difficult to use in the absence of the base code. The 
concept of subsystems allows us to consolidate more of 
the code together and make the subsystem integrator 
responsible for maintaining and running tests. By 
having a single point of contact for a large set of code it 
was felt we had a better chance of getting a solid 
aliveness testing methodology in place. 


The notion of levelizable software is also directly 
addressed by the architectural changes. Recall that if the 
compile-time dependency graph of a system contains no 
loops, it is said to be levelizable[9]. By organizing the 
subsystems into layers and strictly defining the access 
rules for layers, the system is likely to be levelizable. 
With the additional rule that dependencies between 
subsystems within a given layer cannot cause a 
dependency loop, we can guarantee that the entire 
system is levelizable. 


By factoring the system into layers, we get the 
additional benefit that we can build and test the lowest 
layer first and then on up the dependency graph. 


Consequently, the architectural changes lead to testing 
improvements in four areas: allowing the base and PM 
code to be tested together, making the subsystem 
integrator responsible for all subsystem testing, making 
it easy to guarantee a levelizable system, and allowing 
the system to be tested layer by layer. 


5.3.4. Comprehensibility 

A final contribution of repackaging the software into 
subsystems is the ability to provide a consistent and 
easy to use mechanism to learn about and understand 
our software. As our software grew we found it 
increasingly difficult to prepare developers to write new 


code. The learning curve was steep due to the lack of 
accurate internal documentation. Clearly more 
documentation was required, but large detailed 


documents are often imposing to new developers and 
poorly maintain by the author, making their value 
dubious at best. What was needed was a simpler 
approach. 


To this end, each subsystem was required to provide a 
subsystem definition document. This document is 
created and maintained by the subsystem integrator in a 
standard HTML format. The document is not 
comprehensive because it does not deal with subsystem 
implementation. Instead, it briefly describes the purpose 
of the subsystem, the files and libraries it produced and 
its exported interface. Further documentation of the 
exported interface is a set of HTML documents that are 
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generated directly from the exported interface header 
files using ccdoc[14]. In this manner a new or existing 
client of a subsystem can find an overview of the 
subsystem in the subsystem definition document, and 
detailed interface information in the header 
documentation. 


We did not require that the subsystem implementation 
be described in an exported document, both because 
such a document would quickly become out of date and 
because it was “proprietary” knowledge not required by 
a subsystem’s clients. To engineers who typically 
looked at a function’s implementation before deciding 
whether it was what they wanted, this level of 
encapsulation and documentation was a revolutionary 
concept. 


5.4 The Evolutionary Plan 

After reaching consensus across the organization that 
the architecture would be reorganized into subsystems 
and layers, the final step was to schedule each of the 
packages to be converted into its corresponding 
subsystem. This task was made significantly more 
complex by the need to perform feature and device 
support work in the same time frame. In the past, Xilinx 
had tried to undertake major restructuring of its 
software, only to either fail or to wait many months 
before anything worked again. With the tight constraints 
of the release, neither of these risks was acceptable. 
Consequently we decided to convert the packages into 
subsystems in waves, changing only a fraction of the 
packages at a time. The exit criterion for each wave was 
defined to be working software that passed our internal 
engineering system test suite. The advantage of this 
process was that we would have working software after 
each wave. The disadvantage was that engineers had to 
create a different software environment for each wave, 
each with a different mix of old (packages) and new 
(subsystems). In order to mitigate risk, the plan 
introduced inefficiency by requiring almost everything 
in the system to change for every wave. 


To prepare for each wave, the new subsystems were 
created several weeks in advance of the wave in which 
they were first used. This provided each subsystem with 
a trial period where it could be used locally but was not 
required for a wave to complete. To aid this 
mechanism we instituted a nightly build process where 
all the software released that day was built that night. 
These nightly builds gave us the chance to release a 
subsystem in one wave and then attempt to use it 
without affecting all existing clients in a full build. 


In this section we discussed the changes to the software 
architecture and how those changes reflected the key 
concepts and goals for the re-architecture effort. We 
began by evaluating the current architecture against the 
goals and key concepts. We then chose the key concepts 
of autonomy, sharing, testing, and comprehensibility to 
be given first priority in the redesign effort. Based on 
the highest priority key concepts we proposed several 


different architectural solutions intended to address 
them and collected the best features of these proposals 
into a coherent document called the system architecture 
vision. We selected two of the ideas from the vision, 
deciding to create a generic algorithm layer and to re- 
factor the system into subsystems and layers. Finally, 
we created a plan to evolve from the starting point of 
our existing software architecture with limited risk. This 
plan called for the move from packages to subsystems to 
take place in several waves, ensuring that the system 
still functioned after each wave was complete. 


the 
the 


discuss 
evaluate 


In the following _ sections, 
implementation of this plan 
effectiveness of our efforts to date. 


we 
and 


6. Implementation 

As of this writing, the initial transition to layers and 
subsystems is complete, and developers have been 
working with the new architecture for a few months. 
The wave plan succeeded initially, but after 3 waves, 
developers rebelled because the waves required a series 
of changes that needed to be re-visited for every wave. 
As aresult, the final wave was more of a tidal wave that 
swept in all remaining changes. At that point, everyone 
understood the process well enough that we felt the risk 
involved in such a large change was justified by the time 
saved. 


The major implementation hurdles to date have been 
more related to people and group dynamics than to 
technical issues. Initially, many engineers were 
somewhat confused as to what was actually happening, 
often because they were uninterested or too busy with 
other issues to take the time to really understand the 
process. Instead of riding the waves of change, unaware 
developers were hit by them, suddenly discovering that 
all their suppliers had new interfaces. Nevertheless, 
most of the work has been completed and we are not 
significantly behind schedule. 


7. Results and Evaluation 

As we said in Section 1, the re-architecture process is 
ongoing and will continue for many years as the 
software continues to change. However, we can begin to 
evaluate the initial two changes to the software 
architecture: creating a generic algorithms layer and re- 
factoring into subsystems and layers. These tasks are 
themselves incomplete, but preliminary results are 
encouraging. We detail our intermediate evaluation in 
terms of the key concepts these changes are intended to 
affect most. 


7.1 Autonomy 

At the time of this writing, it appears that there has been 
significant improvement in the ability of our engineers 
to work autonomously on their code. By separating the 
interface of a subsystem from its implementation and by 
placing controls on that interface, we have seen far 
fewer integration problems. Requiring engineers to 
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Table 1: Reduction in included files for a typical set of files 


Num. 

Included | Before/ 
Files 
After 


File 
(Level - 
see 
Fig. 2) 
A (level 2) 
B (level 3) 


C (level 4) 


Num. 
Included 
Files 
Before 


D (level 5) 
E (level 5) 


F (level 5) 
G (level 5) 
H (level 5) 
I (level 5) 


obtain approval for interface changes is cumbersome, 
but it makes engineers consider the impact of those 
changes. 





In terms of insulation, results to date have been positive 
but depend greatly on the amount of time the engineer 
spent re-designing the interface to their subsystem. The 
first engineers to begin implementing their new fully 
insulated classes were greatly excited. After they spent 
weeks at a time working upon a single class and 
realizing that time was slipping away, the amount of 
insulation began to decrease dramatically. Engineers 
with less time often did almost nothing to redesign their 
interface. We have begun a detailed review of each 
interface to guide future work in this area. 


In general, the engineers working on the lowest levels of 
the system (see Fig. 2) started working on_ their 
subsystems while other engineers were still working on 
the previous release. As a result, the most progress was 
made in the most heavily used portions of the system, 
which provides the most leverage to improve autonomy 
and reduce overall compilation times. This effect can be 
seen in Table 1, which shows the reduction in include 
file count for several files selected at random from 
various parts of the system. This is an important metric 
both because the number of included files is a rough 
measure of autonomy of a subsystem (more included 
files indicates less autonomy) and because including 
fewer files usually means faster compilations. These 
files were also selected because they are typical and 
because they have essentially the same functionality in 
the old system and the new. The levels in the table refer 
to the level numbering system shown in Fig. 2. Because 
of the additional work spent on insulating the lowest 
levels of the system, the most dramatic ratios of the 
numbers of included files can be seen for files in levels 
2 and 3. For levels 4 and 5, more files are included 
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Table 2: Reduction in compilation times for modules that 
are essentially unchanged 


Ratio: 
Before/ 
After 


Module Compile 
Time (s) 


After 


Compile 
Time (s) 
Before 


A (level 2) 
B (level 3) 


C (level 5) 
D (level 5) 
E (level 5) 
F (level 5) 


because the code is at a higher level of abstraction. 
Moreover, most of the included functionality is from 
levels 2, 3, and 4, which have not been as well insulated. 
As a result, the ratio of the numbers of included files is 
not as dramatic. 





The reduction in the number of included files is also 
reflected in compilation time of the system. However, 
compilation times are difficult to compare both because 
the functionality of any significant subset of the system 
has increased and because computer hardware and 
networks are continually being upgraded. However, 
despite increases in functionality, compilation time for 
the complete system has been reduced from 
approximately 22 hours to approximately 5 hours. (The 
compilation happens in parallel, but the times quoted are 
the sum of the times from each machine.) Consequently, 
despite any hardware improvements, it is safe to say that 
compilation time has decreased. 


Improvements in compilation time can also be seen in 
Table 2, which shows compilation times for several 
modules performed under controlled conditions: on a 
200 MHz UltraSpare machine running Solaris 2.6. 
These modules were selected because they have 
remained relatively unchanged. It is meaningless to 
compare portions of the system where most of the 
improvements were made because the structure and 
function of the code is dramatically different. We can 
only extrapolate from the overall compilation times to 
estimate that the largest compilation improvements are 
in the re-written portions of the system. However, even 
when the module is essentially unchanged, because the 
subsystems on which the module depends has been 
insulated, the compilation time has decreased. 


Although insulation can improve autonomy, it can also 
adversely affect the runtime of the application. In one 
case, several in-line functions were re-introduced into 
an insulated subsystem because of the overhead of the 
function call. In each case, the function was called so 
many times that the function call overhead consumed 1- 
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2% of the overall runtime of an application that ran 
several hours. Faced with this runtime overhead, the in- 
line function was re-introduced. We plan to introduce 
alternative interfaces for clients with such special 
requirements. 


As for encapsulation, the best we can say so far is that 
we have exposed our engineers to the idea of 
encapsulation. A few of the engineers did take this 
concept to heart and completely encapsulated their 
classes. These classes are now benefiting from this work 
in that they can have their class implementation changed 
without affecting their clients. Most engineers went into 
this project assuming that they were going to completely 
encapsulate every class. Once they started this process 
and saw the amount of time it took, they often retreated 
to insulation. This was acceptable because insulation 
provided immediately visible benefits that would 
encourage engineers to return to encapsulation as time 
permits. 


The overall effect of the re-architecture effort on 
autonomy has been dramatic, particularly in the area of 
insulation. The use of encapsulation is more difficult to 
quantify, but we expect it will be more noticeable as the 
system continues to change. 


7.2 Sharing 


Improved sharing via the generic algorithm layer is also 
a long term investment. To date, several algorithms 
have been added to the layer and we have had at least 
one successful re-use. We expect greater utility over 
time, but unlike the optimistic predictions of early 
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advocates of re-use, do not expect dramatic results from 
this effort. 


7.3 Testing 


Because we are not yet at that point in the process, we 
have not yet seen reductions in our testing burden or bug 
count. To provide a starting point for improved testing, 
there will be a special build, called a test build, used by 
developers to work on their internal tests. Each 
subsystem integrator will use this build to determine the 
amount of code coverage for their subsystem. Although 
we have tried to increase code coverage in the past, tight 
coupling with other developer’s code was often used as 
an excuse for poor test coverage We believe that most 
engineers will be horrified by the lack of coverage and 
spend time increasing it. With this initial code coverage 
benchmark for the new software architecture, we can 
require increases in test coverage in future releases. 


7.4 Comprehensibility 


As with the other key concepts, comprehensibility is 
difficult to quantify and results have been mixed. In 
terms of documentation, we have created an on-line 
internal tools documentation area that contains the 
subsystem definition document for each subsystem. We 
have installed the ccdoc tool that creates documentation 
from C++ header files. Developers need only provide a 
specified type of comment in their interface header files 
and these will be added to the generated documentation. 
Even with all of this working, we hear anecdotally that 
not many people are using the browsing facilities. This 
is likely because most engineers have spent years 
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Figure 3. Topologically sorted graph of package include dependencies (before re-architecture). 
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reading header files directly and are still most 
comfortable doing so. Perhaps with the arrival of new 
engineers, this new browsing facility will become more 
useful. Similarly as we begin to move toward a 
heterogeneous mix of C++ and Java we might find these 
facilities more widely used. 


One kind of comprehensibility where improvement is 
more readily apparent is the documentation of the 
overall architecture of the system. Drawings such as 
Fig. 2 provide a simple, high-level picture of the 
intended architecture. A complete drawing of all the 
subsystems and their include dependencies shown in 
Fig. 4 provide a more complete view of the system 
architecture. This graph is topologically sorted, which 
reveals that the compilation order of the system can be 
mapped back to the layers of Fig. 2. Note also that 
redundant edges (a direct edge to a subsystem included 
indirectly) have been removed. Fig. 4 also shows two 
remaining cyclic dependencies in the compilation order 
(upper right). These are in an isolated part of the system 
and will be eliminated soon. 


Comparing Fig. 4 with Fig. 3, which was generated in 
the identical fashion from the previous architecture, it is 
quite apparent that the new architecture is easier to 
comprehend and work with at this level of abstraction. 


8. Conclusions 

In this paper, we introduced a six-step process by which 
Xilinx has made an initial iteration at re-architecting its 
software system. We have followed the progression 
through these steps and discussed the flow from 
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Figure 4. Topologically sorted graph of subsystem include 
dependencies (after re-architecture). Levels refer to Fig. 2. 
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analysis, to goals, to key concepts, to planning, to 
implementation and finally to evaluation. We have 
shown that we can modify a large legacy software 
system and create a software architecture that better 
balances among key concepts that reflect the demands 
of business, requirements of software engineering, and 
OOAD principles. Although this first iteration of our re- 
architecture is not yet complete, already we have seen 
significant gains in developer productivity. 
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Abstract 


Recent developments in Component technology en- 
able the construction of complex software systems 
by assembling together off-the-shelf components. 
However, it is still difficult to develop efficient, 
reliable, and dynamically configurable component- 
based systems. Components are often developed by 
different groups with different methodologies. Un- 
specified dependencies and behavior lead to unex- 
pected failures. 


Component-based software systems must maintain 
explicit representations of inter-component depen- 
dence and component requirements. This provides 
a common ground for supporting fault-tolerance and 
automating dynamic configuration. 


In this paper, we present a generic model for reify- 
ing dependencies in distributed component systems 
and discuss how it can be used to support automatic 
configuration. We describe our experience deploy- 
ing the framework in a CORBA-compliant reflective 
ORB and discuss the use of this model in a new dis- 
tributed operating system. 


1 Introduction 


Research on object-oriented technology and its in- 
tensive use by the industry has led to the develop- 
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ment of component-oriented programming. Rather 
than being an alternative to object-orientation, 
component technology extends the initial concepts 
of objects. It stresses the desire for independent 
pieces of software that can be reused and combined 
in different ways to implement complex software sys- 
tems. 


Recently developed component architectures 
[Ham97, Den97, OMG97] support the construction 
of sophisticated systems by assembling together 
a collection of off-the-shelf software components 
with the help of visual tools or programmatic 
interfaces. However, there is still very little support 
for managing the interactions between components. 
Components are created by different programmers, 
often working in different groups with different 
methodologies. It is hard to create robust and 
efficient systems if the dynamic dependencies 
between components are not well understood. It 
is very common to find cases, in both legacy and 
component-based systems, in which a module 
fails to accomplish its goal because an unspecified 
dependency is not properly resolved. Sometimes, 
the graceful failure of one module is not properly 
detected by other modules leading to system failure. 


A similar problem can be detected in a different con- 
text. Current systems are continuously being up- 
dated and modified. For example, system adminis- 
trators working on UNIX or Windows NT environ- 
ments must be aware of security announcements on 
a daily basis and be prepared to update the oper- 
ating system kernel with security patches. In addi- 
tion, users demand new versions of applications such 
as web browsers, text editors, software development 
tools, and the like. Often, building and installing a 
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new package requires that a series of other tools be 
updated. 


Users of workstations and personal computers are 
also not free from the burden of system or account 
maintenance. In environments like MS-Windows, 
the installation of some applications is partially au- 
tomated by “wizard” interfaces which directs the 
user through the installation process. However, it 
is common to face situations in which the installa- 
tion cannot complete or in which it completes but 
the software package does not run properly because 
some of its (unspecified) requirements are not met. 
In other cases, after installing a new version of a 
system component or a new tool, applications that 
used to work before the update, stop functioning. It 
is typical that applications on MS-Windows cannot 
be cleanly uninstalled. Often, after executing spe- 
cial uninstall procedures, “junk” libraries and files 
are left in the system. 


The problem behind all these difficulties is the lack 
of a model for representing the dependencies among 
system and application components and mecha- 
nisms for managing these dependencies. 


We argue that operating system and middleware 
environments must provide support for represent- 
ing the dependencies among software components 
in an explicit way. This representation can then be 
manipulated in order to implement software com- 
ponents that are able to configure themselves and 
adapt to ever changing dynamic environments. 


By reifying the interactions between system and ap- 
plication components, system software can recog- 
nize the need for reconfiguration to better support 
fault-tolerance, security, quality of service, and op- 
timizations. In addition, it gains the means to carry 
out this reconfiguration without compromising sys- 
tem stability and reliability and with minimal im- 
pact in performance. 


Our research builds on previous and ongoing work 
on software architecture [SG96], dynamic recon- 
figuration of distributed systems [HWP93, Hof94, 
SW98], and quality of service specification [FK98, 
LBS*98]. Our long-term goal is to develop a generic 
model for automatic configuration that can be ap- 
plied to modern component architectures. 
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1.1 Paper Contents 


The initial objective of our research is the support 
for representing dependencies among software com- 
ponents in an explicit way. With that support, we 
develop mechanisms that utilize this representation 
to perform automatic (re)configuration of software 
components in dynamic environments. 


This paper describes our model for representing 
component prerequisites (section 2.1) and runtime 
inter-component dependence (section 2.2). Al 
though we describe the implementation of a frame- 
work for reifying inter-component dependence, the 
details about the implementation of prerequisites 
are out of the scope of this paper and will be ad- 
dressed in a future document. 


Section 3 presents two application scenarios: section 
3.1 describes our experience using the framework to 
support on-the-fly reconfiguration of dynamicTAO, 
a reflective CORBA-compliant ORB and section 3.2 
discusses the use of our model in the 2K distributed 
operating system. 


After discussing related work in section 4, we de- 
scribe our plans for the future in section 5 and 
present our conclusions in section 6. 


2 Inter-Component Dependence 


To address the problems described in the previous 
section, a configuration system must explore two 
distinct kinds of dependencies: 


1. Requirements for loading an inert component 
into the runtime system (called prerequisites). 


2. Dynamic dependencies among loaded compo- 
nents in a running system. 


As long as the system knows exactly what the re- 
quirements are for installing and running a soft- 
ware component, the installation and configuration 
of new components can be automated. As a byprod- 
uct of this knowledge, component performance can 
be improved by analyzing the dynamic state of sys- 
tem resources, analyzing the characteristics of each 
component, and by configuring them in the most 
efficient way. 
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Also, if the system knows what the dynamic depen- 
dencies among running components are, it can (1) 
better handle exceptional behavior that could po- 
tentially trouble component operation, and (2) sup- 
port dynamic reconfigurations of large systems by 
replacing individual components on-the-fly. 


Prerequisites and runtime dependencies are two dis- 
tinct forms of the same entity. Prerequisites usually 
are expressed as dependencies on “persistent” hard- 
ware and software components while runtime de- 
pendencies refer to dynamic, possibly volatile, com- 
ponents. In particular, if one freezes a component’s 
state (including its runtime dependencies) and stops 
it, one could later resume its execution by using 
the frozen runtime dependencies as the prerequi- 
sites for reloading the component. However, in or- 
der to make the model as clear as possible, we are 
going to treat prerequisites and runtime dependen- 
cies as separate entities. Prerequisites usually refer 
to hardware resources, QoS requirements, and soft- 
ware services. Runtime dependencies refer to loaded 
software components. Thus, we believe that the sep- 
aration of concepts is justifiable. In the future, after 
the basic problems are solved, we may consider to 
unify these concepts in order to build a simpler and 
more generic model. 


2.1 Prerequisites 


The prerequisites for a particular inert component 
must specify any special requirement for properly 
loading, configuring, and executing that component. 
We consider three different kinds of information that 
can be contained in a list of prerequisites. 


1. The nature of the hardware resources the com- 
ponent needs. 


2. The capacity of the hardware resources it 
needs. 


3. The software services (i.e., components) it re- 
quires. 


The first two items may be used by a distributed 
Resource Management Service to determine where, 
how, and when to execute the component. QoS- 
aware systems can use these data to enable proper 
admission control, resource negotiation, and re- 
source reservation. The last item is the one which 


determines which auxiliary components must be 
loaded and in which kind of software environment 
they will execute. 


The first two items can be expressed by QoS specifi- 
cation languages [FK98, LBS*98]. The third item is 
equivalent to the requires clause in module intercon- 
nection languages like, for instance, the one used in 
Polylith [Pur94]. We are in the process of analyzing 
existing specification languages to study which ones 
would best fit our needs. The language must al- 
low processing specifications at execution time with 
little overhead. We will deploy initial prototypes 
in 2K, a new CORBA-based distributed operating 
system [KSC*98, CNM98] currently under develop- 
ment. The main purpose of this paper, however, is 
to describe the design and implementation of the in- 
frastructure for representing runtime dependencies 
presented next. 


2.2 Dynamic Dependencies 


In our model, each component is managed by a com- 
ponent configurator which is responsible for storing 
the dependencies between a specific component and 
other system and application components. 


Depending on the way it is implemented, a compo- 
nent configurator may be able to refer to compo- 
nents running on a single address space, on differ- 
ent address spaces and processes, or even running 
on different machines in a distributed system. Fig- 
ure 1 depicts the dependencies that a component 
configurator reifies. 


Each component C has a set of hooks to which other 
components can be attached. These are the com- 
ponents on which C’ depends and are called hooked 
components. There might be other components that 
depend on C, these are called clients. In general, 
each time one defines that a component C, depends 
on a component C2, the system should perform two 
actions: 


1. attach C2 to one of the hooks in C; and 


2. add C; to the list of clients of Co. 


As an example, consider a web browser that spec- 
ifies, in its list of prerequisites, that it requires a 
TCP/IP service, a window manager, and a local file 
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Figure 1: Reification of component dependence. 


service. Its component configurator should main- 
tain a hook for each of these services. When the 
browser is loaded, the system must verify whether 
these services are available in the local environment. 
If they are not, it must create new instances of them. 
In any case, references to the services are stored in 
the browser configurator hooks and may be later 
retrieved and updated if necessary. 


2.2.1 The ComponentConfigurator class 


The reification of runtime dependencies is accom- 
plished by assigning one ComponentConfigurator ob- 
ject to each component. A simplified declaration 
of the ComponentConfigurator class in pseudo-C++ 
follows. Figure 2 shows a schematic representation 
of some of its method calls. 


The class constructor receives a pointer to the 
component implementation as a parameter. It 
can be later obtained through the implementation() 
method. 


The hook() method is used to specify that this com- 
ponent depends upon another component and un- 
hook() breaks this dependence. The registerClient() 
and unregisterClient() methods are similar to hook() 
and unhook() but they specify that other compo- 
nents (called clients) depend upon this component. 
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class ComponentConfigurator { 
public: 

ComponentConfigurator (Object *implementation) ; 
~ComponentConfigurator (); 


int hook (const char *hookName, 
ComponentConfigurator *component) ; 

unhook (const char *hookName) ; 
registerClient 

(ComponentConfigurator *client, 

const char *hookNameInClient = NULL); 
unregisterClient 

(ComponentConfigurator *client) ; 


int 
int 


event OnHookedComponent 
(ComponentConfigurator *hookedComponent, 
Event e); 

eventOnClient 
(ComponentConfigurator *client, 
Event e); 


char *name (); 
char *info (); 
DependencyList *listHooks (); 
DependencyList *listClients (); 
ComponentConfigurator * 
getHookedComponent (const char *hookName) ; 


Object *implementation (); 










} 
depends on depends on 
oS. pe SSS 
eventOnHookedComponent 


eventOnClient () 


(un)registerClient () 


HZmMZOVZON 





HOOKED 


COMPONENTS CLIENTS 


Figure 2: Methods for specifying dependencies and 
sending events. 


eventOnHookedComponent() announces that a com- 
ponent which is attached to this component has gen- 
erated an event. The ComponentConfigurator() class 
is subclassed to implement different behaviors when 
events are reported. Examples of common events 
are the destruction of a hooked component, the in- 
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ternal reconfiguration of a hooked component, or 
the replacement of the implementation of a hooked 
component. 


eventOnClient() is similar to the previous method 
but it announces that a client has generated an 
event. This can be used, for example, to trig- 
ger reconfigurations in a component to adapt to 
new conditions in its clients. Our reference im- 
plementation defines a basic set of events including 
DELETED, FAILED, RECONFIGURED, REPLACED, 
and MIGRATED. Applications can extend this set by 
defining their own events. 


name() returns a pointer to a string containing the 
name of the component and info() returns a pointer 
to a string containing a description of the compo- 
nent. Specific info() implementations can return dif 
ferent kinds of information like a list of configuration 
options accepted by the component, or a URL for 
its documentation and source code. 


listHooks() returns a pointer to a list of Depen- 
dencySpecifications. A DependencySpecification is a 
structure defined as 


struct DependencySpecification { 


const char *hookName; 
ComponentConfigurator *component ; 


Ne 





listClients() returns a pointer to a list of Dependen- 
cySpecifications corresponding to the components 
that depend on this component (its clients) and the 
name of the hooks (in the client’s ComponentCon- 
figurator) to which this component is attached. 


Finally, getHookedComponent() returns a pointer to 
the configurator of the component that is attached 
to a given hook. 


2.2.2 Towards Automatic Configuration 


As discussed above, reified inter-component depen- 
dence can help the automation of configuration pro- 
cesses. By scanning the list of prerequisites, the op- 
erating system or middleware can be certain that 
all hardware and software requirements for the ex- 
ecution of a particular component are met before it 
is initiated. This can avoid a large number of prob- 
lems that are common in existing systems where the 


lack of a particular component or resource is only 
detected after the application is running. 


The dynamic dependence information, in its turn, 
enables the reconfiguration of components that are 
already running. In addition, it provides impor- 
tant information for implementing fault-tolerance 
and smooth exception handling in an environment 
of centralized or distributed components. 


As an example, consider the deletion of a component 
containing our ComponentConfigurator class. Differ- 
ent policies for dealing with component deletion can 
be adopted. In general, when a component C is de- 
stroyed, an announcement must be made to com- 
ponents that depend on C and to components on 
which C depends. The following piece of pseudo- 
C++ code illustrates this process with a conserva- 
tive implementation of the ComponentConfigurator 
destructor. 


ComponentConfigurator: : ~ComponentConfigurator() 

t 

for (c in hookedComponents) { 
c.configurator->unregisterClient (this) ; 


a 


for (c in clients) { 
c.configurator-> 
eventOnHookedComponent (this, 
DELETED) ; 
F 


// delete list of hooks and hookedComponents 
// delete list of clients 

// release resources 

// delete component implementation 

}// ~ComponentConfigurator () 





Implementations of this destructor can be special- 
ized to adjust its behavior to different component 
types and to meet special requirements. Also, dif- 
ferent component types must implement methods 
such as eventOnHookedComponent() in proper ways 
to take care of the different kinds of dependencies. 
In an extreme case, deleting a component will cause 
all components that depend on it to be deleted. In 
the other extreme case, these other components will 
only be notified and nothing else will change. In 
most of the cases, we expect that these components 
will try to reconfigure themselves in order to deal 
with the loss of one of its dependencies. 


The problem with this implementation is that the 
complete destruction of the component only takes 
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place if all the method calls to hooked components 
and clients return. If any of these calls block, the 
component is not deleted. This problem is particu- 
larly important if some of the clients decide to ini- 
tiate their own destruction as a result of the call 
to eventOnHookedComponent() and a long chain of 
calls is established. 


A naive solution to this problem could be to execute 
the method calls asynchronously, for example, by 
creating new threads to perform the calls. This so- 
lution would incur in the additional cost of creating 
new threads and could lead to dangerous situations 
as a C++ component could try to call a method on 
another component after the latter is destroyed. 


Thus, it seems that we are trapped between a safe, 
conservative solution that might block indefinitely 
and a liberal but unsafe solution that may crash 
the whole system by executing invalid code. We 
have been studying this problem and, in [KC98], we 
discuss solutions that lie somewhere between these 
two extremes. They are as safe as the conservative 
one but are less subject to blocking. 


2.2.3 Managing Dependencies 


The use of our model in a language like C++ re- 
quires strict collaboration from the component de- 
veloper to conform to proposed guidelines. It is also 
important that all the communication between com- 
ponents be done through controlled interfaces. In 
order to avoid a proliferation of programming errors 
related to dependence reification, it would be nec- 
essary to develop special languages, compilers, and 
runtime systems to guarantee the safety of compo- 
nent execution and reconfiguration. 


A cleaner solution would be to use existing reflec- 
tive languages and environments. Iguana [GC96] 
and OpenC++ [Chi95], for example, are extensions 
to C++ that reify several features of this language, 
allowing dynamic modification of their implementa- 
tions. In these languages, it would be possible to 
instrument method invocation to take care of de- 
pendence maintenance. 


However, a major goal of our research is not to limit 
the implementation to a particular programming 
language and only use widely accepted standards. 
We could also tie together the mechanisms for com- 
munication and dependence representation using, 
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for example, abstract connectors [SDZ96]. But this 
could limit the expressiveness of the model. Our 
objective is to develop a generic methodology that 
could be utilized in a large number of heterogeneous 
environments. These requirements can only be met 
by using a standard architecture like CORBA. 


2.3 CORBA ComponentConfigurator 


CORBA permits the integration of components 
written in different programming languages on het- 
erogeneous environments. In addition, CORBA’s 
(remote) method invocation mechanism can be de- 
coupled from the base language method call. Thus, 
it is possible to guarantee that bad CORBA ref- 
erences are not translated into bad base language 
references (like dangling C++ pointers for exam- 
ple). Instead, exceptions are neatly handled by the 
runtime and the application is informed of its oc- 
currence. 


In the CORBA implementation of our model, a De- 
pendencySpecification stores a CORBA Interopera- 
ble Object Reference (IOR) so that the Component- 
Configurator is able to reify dependencies among 
distributed components. Prerequisites for software 
components can be specified either in terms of per- 
sistent IORs [Hen98] or in terms of service type and 
attributes. In the former case, an implementation 
repository can be used to dynamically create a new 
CORBA object if one is not available. In the latter 
case, the CORBA Trading Object Service [OMG98] 
can be used to locate an instance of the server com- 
ponent that meets the requirements specified by the 
given attributes. 


When a CORBA component is destroyed, the com- 
ponent implementation (or the ORB) must call the 
configurator destructor so that it can tell its clients 
that the destruction is taking place. If a node 
crashes or if the whole process containing both the 
component and the configurator crash, it might not 
be possible to execute the configurator destructor. 
In this case, the clients will not be informed of the 
component destruction. Subsequent CORBA invo- 
cations to the crashed component will raise an ex- 
ception announcing that the object is not reachable 
or that it does not exist. In this case, it is the re- 
sponsibility of the client component to locate a new 
server component and update its ComponentConfig- 
urator. 
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As future work, we intend to perform experiments 
with the different ways of using the CORBA Com- 
ponentConfigurator to manage distributed applica- 
tions. In particular, component configurators can 
be (1) co-located with their respective component 
implementations, (2) located in a separate process 
in the same machine or (3) located in a centralized 
node on the network while the component imple- 
mentations are distributed. We will investigate the 
benefits of the different approaches. 


2.4 Implementation Status 


We have implemented prototypes of the Component- 
Configurator for centralized applications in C++ 
and Java. The C++ implementation was deployed 
in the dynamicTAO ORB as described in section 
3.1. We have recently completed an implementa- 
tion of distributed ComponentConfigurators based 
on CORBA. 


We plan to extend the Java implementation to sup- 
port Java Bean components and distributed object 
communication with Java RMI. We will, then, work 
on the interoperability among different implemen- 
tations of the model in different component archi- 
tectures. 


3 Application Scenarios 


This section describes the deployment of the Compo- 
nentConfigurator framework in dynamicTAO, a re- 
flective Object Request Broker. It illustrates how 
our model can be used to represent and manipu- 
late the internal structure of a legacy system, en- 
abling dynamic reconfiguration. We, then, discuss 
how this framework will be used to support archi- 
tectural awareness in the 2K distributed operating 
system. 


3.1 dynamicTAO 


One of the major constituent elements of 2K, a dis- 
tributed operating system our group is developing 
[KSC*+98, CNM98], is a reflective middleware layer 
based on CORBA. After carefully studying existing 
Object Request Brokers, we came to the conclusion 


that the TAO ORB [SC99] would be the best start- 
ing point for developing our infrastructure. TAO 
is a portable, flexible, extensible, and configurable 
ORB based on object-oriented design patterns. It 
uses the Strategy design pattern [GHJV95] to sepa- 
rate different aspects of the ORB internal engine. A 
configuration file is used to specify the strategies the 
ORB uses to implement aspects like concurrency, 
request demultiplexing, scheduling, and connection 
management. At ORB startup time, the configu- 
ration file is parsed and the selected strategies are 
loaded. 


TAO is primarily targeted for static hard real- 
time applications such as Avionics systems [HLS97]. 
Thus, it assumes that, once the ORB is initially con- 
figured, its strategies will remain in place until it 
completes its execution. There is very little support 
for on-the-fly reconfiguration. 


The 2K project seeks to build a flexible infrastruc- 
ture to support adaptive applications running on 
dynamic environments. On-the-fly adaptation is 
extremely important for a wide range of applica- 
tions including the ones dealing with multimedia, 
mobile computers, and dynamically changing envi- 
ronments. 


The design of 2K depends on dynamicTAO, an ex- 
tension of TAO that enables on-the-fly reconfigura- 
tion of its strategies. dynamicTAO exports an in- 
terface for loading and unloading modules into the 
ORB runtime, and for inspecting the ORB configu- 
ration state. The architecture can also be used for 
dynamic reconfiguration of servants running on top 
of the ORB and even for reconfiguring non-CORBA 
applications. 


3.1.1 Problems Encountered 


Reconfiguring a running ORB while it is servicing 
client requests is a difficult task that requires care- 
ful consideration. There are two major classes of 
problems. 


Consider the case in which dynamicTAO receives a 
request for replacing one of its strategies (Soia) by a 
new strategy (Snew). The first problem is that, since 
TAO strategies are implemented as C++ objects 
that communicate through method invocations, be- 
fore unloading S,i¢, the system must be sure that 
no one is running S,7q code and that no one is ex- 
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pecting to run Soiq code in the future. Otherwise, 
the system could crash. Thus, it is important to as- 
sure that Soiqg is only unloaded after the system can 
guarantee that its code will not be called. 


The second problem is that some strategies need 
to keep state information. When a strategy Sora is 
being replaced by Snew, part of Soia’s internal state 
may need to be transfered to Snew. 


These problems can be addressed with the help of 
the ComponentConfigurator which is used to reify 
the dependencies among strategies, instances of dy- 
namicTAO, and servants. 


3.1.2 DomainConfigurator and TAOConfig- 
urator 


Each process running the dynamicTAO ORB con- 
tains a ComponentConfigurator instance called Do- 
mainConfigurator. It is responsible for maintaining 
references to instances of the ORB and to servants 
running in that process. In addition, each instance 
of the ORB contains a customized subclass of Com- 
ponentConfigurator called TAOConfigurator. 


TAOConfigurator contains hooks to which dynam- 
icTAO strategies are attached. A NetworkBroker 
implements a simple TCP-based protocol that al- 
lows remote entities to connect to the process to 
inspect and change the configuration of dynam- 
icTAO by loading new strategies and attaching 
them to specific hooks. Local servants and remote 
CORBA clients can also access the Configurator ob- 
jects through a programmatic CORBA interface. 
Figure 3 illustrates this mechanism when a single 
instance of the ORB is present. 


If necessary, individual strategies may have their 
own customized subclass of ComponentConfigurator 
to manage their dependencies upon ORB instances 
and other strategies. These subclasses may also 
store references to client connections that depend 
on them. With this information, it is possible to 
decide when a strategy can be safely unloaded. 


Consider, for example, the three concurrency strate- 
gies supported by dynamicTAO: Single-Threaded 
Reactive [(Sch94], Thread-Per-Connection, and 
Thread-Pool. If the user switches from the Reactive 
or Thread-Per-Connection strategies to any other 
concurrency strategy, nothing special needs to be 
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Figure 3: Remote Configuration of dynamicTAO 
strategies. 


done. dynamicTAO may simply load the new strat- 
egy, update the proper TAOConfigurator hook, un- 
load the old strategy, and continue. Old client con- 
nections will complete with the concurrency policy 
dictated by the old strategy. New connections will 
utilize the new policy. 


However, if one switches from the Thread-Pool 
strategy to another one, special care must be taken. 
The Thread-Pool strategy we developed maintains 
a pool of threads that is created when the strat- 
egy is initialized. The threads are shared by all in- 
coming connections to achieve a good level of con- 
currency without having the runtime overhead of 
creating new threads. A problem arises when one 
switches from this strategy to another strategy: the 
code of the strategy being replaced cannot be imme- 
diately unloaded. This happens because, since the 
threads are reused, they return to the Thread-Pool 
strategy code each time a connection finishes. This 
problem can be solved by a ThreadPoolConfigurator 
keeping information about which threads are han- 
dling client connections and destroying them as the 
connections are closed. When the last thread is de- 
stroyed the Thread-Pool strategy signalizes that it 
can be unloaded. 


Another problem occurs when one replaces the 
Thread-Pool strategy by a new one. There may be 
several incoming connections enqueued in the strat- 
egy waiting for a thread to execute them. The so- 
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lution is to use the Memento pattern [GHJV95] to 
encapsulate the old strategy state in an object that 
is passed to the new strategy. An object is used to 
encapsulate the queue of waiting connections. The 
system simply passes this object to the new strategy 
which then takes care of the enqueued connections. 


Our group is currently expanding the set of dynam- 
icTAO strategies that can be replaced on-the-fly. 
The TAOConfigurator will have hooks for holding 
strategies for connection management, concurrency, 
(de)marshalling, request demultiplexing, method 
dispatching, scheduling, and security. An explicit 
knowledge of the dependencies among the ORB 
components is essential for implementing dynamic 
reconfiguration safely. 


3.2 Architectural Awareness in 2K 


In contrast to existing systems where a large number 
of non-utilized modules are carried along with the 
basic system installation, the 2K operating system 
is based upon a “what you need is what you get” 
(WYNIWYG) model. The system configures itself 
automatically and loads the minimum set of com- 
ponents required for executing user applications in 
the most efficient way. Components are downloaded 
from the network and only a small subset of system 
services are needed to bootstrap a node. 


This is achieved by reifying the hardware and soft- 
ware prerequisites for each loadable component. As 
mentioned in section 2.1, the operating system can 
use this information to make sure that all the ba- 
sic services that a component requires are available 
before the component is loaded. In addition, a dis- 
tributed resource manager uses the specifications of 
the component hardware requirements to decide in 
which machine the component should be loaded and 
perform admission control and resource reservation. 
That way, one will not face a situation in which 
a component fails to execute its task with the de- 
sired quality of service because an unspecified de- 
pendency was not resolved. 


As a component is loaded into the system, its pre- 
requisites are scanned and all the specified services 
are made available. During this process, the sys- 
tem can incrementally build a dynamic graph of de- 
pendencies using the ComponentConfigurator frame- 
work. 


The design of 2K supports fault-tolerant, self- 
adapting systems by monitoring the environment 
and maintaining a representation of the dynamic 
structure of its services and applications. The 
CORBA implementation of the ComponentConfig- 
urator framework reifies the distributed system dy- 
namic structure. 


When a 2K component fails, the system inspects its 
dependencies and informs the proper components 
about the failure. The system may alternatively re- 
cover from a failure by replacing the faulty compo- 
nent with a new one. The same mechanism can 
be used for adapting the system and its compo- 
nents to changing parameters such as network band- 
width, CPU load, resource availability, user access 
patterns, etc. 


4 Related Work 


The idea of using prerequisites to represent the de- 
pendencies among operating system objects was in- 
troduced in the SOS operating system [SGH*89] 
developed at INRIA, France. In the SOS model, 
objects contain a list of prerequisites that must be 
satisfied before they are activated. Even though 
the idea was promising, it was not fully explored 
in that project. Prerequisites were only used to ex- 
press that an object depends on the code imple- 
menting it. Not much experimentation was carried 
out [SGM89, Sha98]. SOS does not include a model 
for dynamic management of inter-component depen- 
dence. 


Previous research in microkernels and customizable 
operating systems — such as Mach [Lop91], SPIN 
[BSP*95], Exokernel [KEG*97], and spChoices 
[LTC96] — developed low-level techniques for dy- 
namic loading new modules to the operating sys- 
tem both in kernel and user space. Nevertheless, 
a high-level model for operating system reconfigu- 
ration is still inexistent. These previous works have 
not addressed a number of problems related to fault- 
tolerance and dynamic reconfiguration. Using the 
ComponentConfigurator framework, our research in- 
vestigate answers to the following questions. 


e What are the consequences of reconfiguring the 
operating system? 


e When asystem module is replaced, which other 
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modules are affected? 
e How must those other modules react? 


e When (re)configuring the system, which com- 
ponents must be loaded to meet the service de- 
mand and the required quality of service? 


e Ifasystem component fails, how can the system 
detect it and recover gracefully? 


We are currently investigating languages for prereq- 
uisite specification. They must be able to repre- 
sent hardware and quality of service requirements 
as well as dependencies on other software compo- 
nents. Thus, we believe that an ideal language for 
prerequisite specification will build on previous work 
on Architecture Description Languages [Cle96] and 
QoS Specification Languages [FK98, LBS* 98}. 


Connector-based systems like UniCon [SDZ96] and 
software buses like POLYLITH [Pur94] separate 
issues concerning component functional behavior 
from component interaction. Our model goes one 
step further by separating inter-component commu- 
nication from inter-component dependence. Con- 
nectors and software buses require that applica- 
tions be programmed to a particular communica- 
tion paradigm. Our framework is independent of 
the paradigm for inter-component communication; 
it can be used in conjunction with connectors, buses, 
local method invocation, CORBA, Java RMI, etc. 


Communication and dependence are often inti- 
mately related. But, in many cases, the dis- 
tinction between inter-component dependence and 
inter-component communication is beneficial. For 
example, the quality of service provided by a multi- 
media application is greatly influenced by the mech- 
anisms utilized by underlying services such as vir- 
tual memory, scheduling, and memory allocation 
(through the new operator). The interaction be- 
tween the application and these services is often 
implicit, i.e., no direct communication (e.g. library 
or system calls) takes place. Yet, if the system in- 
frastructure allows developers to establish and ma- 
nipulate dependence relationships between the ap- 
plication and these services, the application can be 
informed of substantial changes in the state and con- 
figuration of the services that may affect its perfor- 
mance. 


Differently from previous work in this area, our 
model does not dictate a particular communication 
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paradigm like connectors or buses. As shown in sec- 
tion 3.1, the model was applied to a legacy system 
without requiring any modification to its functional 
implementation or to its inter-component commu- 
nication mechanisms. 


We are particularly interested in investigating the 
possibilities of applying results from previous and 
ongoing work in dynamic reconfiguration [HWP93, 
SW98, BBB*98] to standard architectures such as 
CORBA and Java Beans. 


5 Ongoing and Future Work 


The current implementation of the framework in 
C++ is being used in dynamicTAO as its dynamic 
reconfigurability is enhanced. In addition, the Java 
implementation is being used by researchers at the 
University of Sao Paulo to prototype a domain 
decomposition manager. This manager has two 
demonstration applications: a Distributed Informa- 
tion System for Mobile Agents [SGE98] and the 
parallelization of an Atmospheric Modeling System 
[Bar98]. 


Work on implementations of the framework in Java 
RMI is underway. As discussed in 3.2, the CORBA 
implementation of the ComponentConfigurator will 
be used in the 2K operating system to support run- 
time architectural awareness as the basis for imple- 
menting fault-tolerant reconfigurable systems. The 
prerequisites model will be used for QoS-aware re- 
source management. This will provide components 
with all the hardware and software resources they 
need to execute with the desired quality of service. 


6 Conclusions 


We have presented a model for runtime archi- 
tectural awareness in centralized and distributed 
component-based systems. We believe that the reifi- 
cation of inter-component dependence and compo- 
nent prerequisites is fundamental for systems sup- 
porting fault-tolerant, reconfigurable components. 


The model has been prototyped in Java, C++, and 
CORBA. The C++ framework was successfully de- 
ployed in dynamicTAO, a legacy system, which was 
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made aware of its own internal structure. 


Future work in the 2K operating system will demon- 
strate how the model behaves in a complex, dis- 
tributed CORBA-based system. 
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Availability 


The framework source code in C++ and Java 
is available at http://choices.cs.uiuc.edu/2k/ 
ComponentConfigurator . 


The source code and detailed documentation for 
dynamicTAO can be found at 
http://choices.cs.uiuc.edu/2k/dynamicTAO . 
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Abstract! 


Architectural evolution is a costly yet unavoidable 
consequence of a successful application. One method 
for reducing cost is to automate aspects of the 
evolutionary cycle when possible. Three kinds of 
architectural evolution in object-oriented systems are: 
schema transformations, the introduction of design 
pattern microarchitectures, and the hot-spot-driven- 
approach. This paper shows that all three can be viewed 
as transformations applied to an evolving design. 
Further, the transformations are automatable with 
refactorings —  behavior-preserving program 
transformations. A comprehensive list of refactorings 
used to evolve large applications is provided and an 
analysis of supported schema transformations, design 
patterns, and hot-spot meta patterns is presented. 
Refactorings enable the evolution of architectures on an 
if-needed basis reducing unnecessary complexity and 
inefficiency. 


1 Introduction 


All successful software applications evolve [Par79]. 
During the 1970s, evolution and maintenance 
accounted for 35 to 40 percent of the software budget 
for an information systems organization. This number 
jumped to 60 percent in the 1980s. It was predicted that 
without a major change in approach, many companies 
will spend close to 80 percent of their software budget 
on maintenance [Pre92]. As applications evolve, so do 
their architectures. Architectures evolve for multiple 
reasons: 


* Capability — to support new features or changes 


1. We gratefully acknowledge the sponsorship of 
Microsoft Research, the Defense Advanced 
Research Projects Agency (Cooperative Agreement 
F30602-96-2-0226), and the University of Texas at 
Austin Applied Research Laboratories. 


to existing features. 


* Reusability — to carve out software artifacts for 
reuse in other applications. 


¢ Extensibility — to provide for the addition of 
future extensions. 


¢ Maintainability — to reduce the cost of software 
maintenance through restructuring. 


We have observed that architectures also evolve for 
human reasons: 


° Experience. Experienced employees can often 
design a better architecture based on_ their 
knowledge of the current architecture. 


¢ New Perspective. New project members often have 
new ideas about how an architecture could or 
should be structured. Many organizations use a 
code ownership model which empowers new 
employees with the ability to realize their new 
designs. 


While motivations vary, the methods used for evolving 
architectures appear to follow regular patterns, 
particularly for object-oriented applications. Three 
kinds of object-oriented architectural evolution are: 
schema transformations, the introduction of design 
pattern microarchitectures, and the hot-spot-driven- 
approach. Schema transformations are drawn from 
object-oriented database schema transformations that 
perform edits on a class diagram [Ban87]. Examples are 
renaming a class, adding new instance variables, and 
moving a method up the class hierarchy. Design 
patterns are recurring sets of relationships between 
classes, objects, methods, etc. that define preferred 
solutions to common object-oriented design problems 
[Gam95]. The hot-spot-driven-approach is based on the 
identification of aspects of a software program which 
are likely to change from application to application (i.e. 
hot-spots) [Pre95]. Architectures using abstract classes 
and template methods are prescribed to keep these hot- 
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spots flexible. 


Refactorings are — behavior-preserving program 
transformations which directly aid in the 
implementation of new architectures. Primitive 


refactorings perform simple edits such as adding new 
classes, creating instance variables, and moving 
instance variables up the class hierarchy. Compositions 
of refactorings can create abstract classes, capture 
aggregation and components [Opd92], extract template 
and hook methods, and even install design pattern 
microarchitectures [Tok95]. Although composing 
refactorings to achieve a desired result may require 
some planning, this effort is negligible compared to the 
manual task of identifying all lines of source code 
affected by a change, performing hand-edits, retesting 
all fixes, and risking the introduction of new errors. 


We are pursuing two approaches to promote refactoring 
research. The first is to evaluate refactorings of large 
applications. We believe that we are the first to provide 
empirical evidence on the usefulness of refactorings 
when applied to non-trivial applications [Tok99]. In one 
example, a major architectural change — the splitting 


of a class hierarchy — is automated”. In a second 
example on a large application (~SOOK LOCS), 
approximately fourteen thousand lines of code changes 
between two code releases are automated with 
refactorings. (See paper for details.) 


A second approach to promoting refactoring research is 
to demonstrate that common forms of architectural 
evolution can be automated. This paper catalogs the 
schema transformations, design pattern restructurings, 
and hot-spot meta patterns which can be automated 
with refactorings. By demonstrating broad coverage of 
common modes of evolution, it is argued that 
refactorings will be generally useful in the evolutionary 
maintenance cycle. 


A summary of the class diagram notation used 
throughout the remainder of this paper is presented in 
Figure 1. Within the main body of text, we use the 
following conventions: 


¢ Refactoring — a refactoring. 


¢  AbstractClass — an abstract class name. 


2. The term automated in this paper refers to a refactoring’s 
programmed check for enabling conditions and its execu- 
tion of all source code changes. The choice of which 
refactorings to apply is always made by a human. 
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Figure 1: Notation 


* ConcreteClass — a concrete class name. 
* Method() —a method or procedure name. 


* Instance_variable — an instance variable. 


2 Refactorings 


A refactoring is a parameterized behavior-preserving 
program transformation that automatically updates an 
application’s design and underlying source code. A 
refactoring is typically a very simple transformation, 
one that has a straightforward (but not necessarily 
trivial) impact on application source code. An example 
is inherit[Base, Derived], which establishes a 
superclass-subclass relationship between two classes, 
Base and Derived, that were previously unrelated. From 
the perspective of an object-oriented class diagram, 
inherit merely adds an inheritance relationship between 
the Base and Derived classes, but it also checks enabling 
conditions to determine if the change can be made 
safely and it alters the application’s source code to 
reflect this change. A refactoring is more precisely 
defined by (a) a purpose, (b) arguments, (c) a 
description, (d) enabling conditions, (e) an initial state, 
and (f) a target state. Such a definition for inherit[Base, 
Derived] is given in Figure 2. Applying refactorings is 
superior to hand-coding similar changes because it 
allow a designer to evolve the architecture of an 
existing body of code at the level of a class diagram 
leaving the code-level details to automation. 


Banerjee and Kim proposed a set of schema evolutions 
for evolving object-oriented database schemas [Ban87] 
and Opdyke proposed a list of primitive refactorings for 
object-oriented languages [Opd92]. Roberts implements 
many of these refactorings for the Smalltalk language 
[Rob97]. In addition to previous refactorings, we have 


USENIX Association 
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Name: 
Inherit[ Base, Derived ] 


Purpose: 
To establish a public superclass-subclass relationship 
between two existing classes. 


Arguments: 
Base - superclass name 
Derived - subclass name 


Description: 

Inherit[] makes Base a superclass of Derived. vm* () 
represents the unimplemented virtual methods 
inherited by Base subclasses. 

Enabling Conditions: 
* Base must not be a subclass of Derived and Derived 
must not have a superclass. 
* Subclasses of Base must support methods vm* () if 
objects of that class are created. Otherwise, there will 
be no implementations for vm* () . 
¢ Initializer lists must not be used to initialize Derived 
objects. Initializer lists must initialize aggregates and 
aggregates cannot have superclasses [EI190]. 


¢ Program behavior must not depend on the size of 
Derived. Adding a superclass can affect its size. 


I 

Base 1 

* 1 

vm () i 
- 


1 
: 1 
Derived \ 
| 
I 
' 





(a) Initial State (b) Target state 


Figure 2: Inherit[Base, Derived] transformation 


found that transforming actual applications requires a 
larger set. We enlarged the set of schema evolutions to 
include, for example, substitute. Substitute changes a 
class’ dependency on some class C to a dependency on 
a superclass of C [Tok95]. A second new set of 
refactorings is language-specific. 
Procedure_to_method and _ structure_to_class are 
used to convert C artifacts to their C++ equivalents. A 
third set supports the addition of design pattern 
microarchitectures in evolving programs [Tok95]. An 
example is add_factory_method which creates a 
method returning a new object and replaces all C++ 
invocations of "new Object" with a call to the 
method. This refactoring is used to add the Factory 
Method design pattern [Gam95]. 


A list of refactorings used in our research is presented 
in Table 1. Refactorings proposed in previous work are 























move_method_across_ 
object_boundary 
extract_code_as_method 
declare_abstract_method 
structure_to_pointer 


Schema Refactorings 
add_variable 
create_variable_accessor 
create_method_accessor 
rename_variable 
remove_variable 
push_down_variable 
pull_up_variable 
move_variable_across_ 

object_boundary 
create_class 
rename_class 
remove_class 


C++ Refactorings 


procedure_to_method 
structure_to_class 


Design Pattern Refactorings 
add_factory_method 
create_iterator 


inherit composite 
uninherit decorator 
substitute procedure_to_command 
rename_method singleton 


1. [Ban87] 
2. [Opd92] 
3. [Tok95] 
4. [Rob97] 


remove_method 
push_down_method 
pull_up_method 





Table 1: Object-oriented refactorings 






1 
: 
= 1 
3 
' 

up_variable to move 


Figure 3.1: Using pull_ 
instance variables "iv" from 
Derived to Base 


eal 





noted. Refactorings first implemented by our research 
for object-oriented software evolution appear in italics. 


3 Automatable Modes of Evolution 


3.1 Schema Transformations 


The database schema for an object-oriented database 
management system (OODBMS) looks like a class 
diagram for an object-oriented application. Similarly, 
OODBMS schema transformations have parallels in 
object-oriented software evolution. An example schema 
transformation is moving the domain of an instance 
variable up the inheritance hierarchy Figure 3.1. This 
transformation is supported by the refactoring 
pull_up_variable which moves an instance variable to 
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a superclass. 


Banerjee and Kim describe 19 object-oriented database 
schema transformations of which we implement 12 as 


automated refactorings: 


Description from 
Banerjee and Kim 
[Ban87] 


Refactoring from Table | 


add_variable 


Adding a new 
instance variable 


Drop an existing remove_variable 


instance variable 


Change the name of | rename_variable 


an instance variable 


pull_up_variable and 
push_down_variable 


Change the domain 
of an instance vari- 
able 


Drop the composite 
link property of an 


structure_to_pointer 


remove_method 


instance variable* 


Drop an existing 
method 


Change the name of 
a method 


Make aclassSa 





Figure 3.2: Using move_variable_across_object_b 
move instance variables x and y 


[Ban87] are: 








Move a variable 
through a composite 
link 


move_variable_across_o 
bject_boundary (Figure 
3.2) 






Move a method 
through a composite 
link 






move_method_across_o 
bject_boundary 








Change a class’ 
dependency on a 
class C to a depen- 
dency ona superclass 
S of C 


substitute (Figure 3.3) 















superclass of class C 


Remove class S as a 
superclass of class C 


Schema transformations perform many of the simple 
edits encountered when evolving class diagrams. They 
can be used alone or in combination to evolve object- 
oriented architectures. 


create_class 


remove_class 
rename_class 


a. Aclass A with an instance variable of class B having 
the composite link property specifies that A owns B. 
B cannot be created independently of A and B can- 
not be accessed through a composite link of another 
object. 


Add a new class 


Drop an existing 
class 


Change the name of 
a class 





Three other useful schema transformations not listed in 
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The seven refactorings which are not supported are: 
changing the value of a class variable, changing the 
code of a method, changing the default value of an 
instance variable, changing the inheritance parent of 
an instance variable, changing the inheritance of a 
method, adding a method, and changing the order of 
superclasses. The first three refactorings are not 
behavior-preserving. The next two are not supported 
by mainstream object-oriented programming lan- 
guages. The sixth (adding a method) cannot be auto- 
mated. The seventh (changing the order of 
superclasses) is not supported because this research 
is currently limited to applications without multiple 
inheritance. 





USENIX Association 
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Figure 3.3: Using substitute to change Filer’s 
reference to a Letter to a reference to a 
Document 


3.2 Design Pattern Microarchitectures 


Design patterns capture expert solutions to many 
common object-oriented design problems: creation of 
compatible components, adapting a class to a different 
interface, subclassing versus subtyping, isolating third 
party interfaces, etc. Patterns have been discovered in a 
wide variety of applications and toolkits including 
Smalltalk Collections [Gol84], ET++ [Wei88], MacApp 
[App89], and InterViews [Lin92]. As with database 
schema transformations, refactorings have been shown 
to directly implement certain design patterns: 


Command encapsulates a | [Tok99] 
request as an object, 
thereby letting you 
parameterize clients with 
different requests, queue 
or log requests, and sup- 
port undoable opera- 
tions. The 
procedure_to_comman 
d refactorings converts a 
procedure to a command 
class, 


Factory Method defines 
an interface for creating 
an object, but lets sub- 
classes decide which 
class to instantiate. The 
add_factory_method 
refactoring adds a fac- 
tory method to a class. 


[Tok95] 





[Tok99] 








Singleton | Singleton ensures a class 
will have only one 
instance and provides a 
global point of access to 
it. The singleton refactor- 
ing converts an empty 


class into a singleton. 





We directly support three additional patterns as 
refactorings: 


Composite | Composite composes objects into tree 
structures to represent part-whole hier- 
archies. The composite refactoring 


converts a class into a composite class. 





Decorator | Decorator attaches additional responsi- 
bilities to an object dynamically. The 
decorator refactoring converts a class 


into a decorator class. 


Iterator Iterator provides a way to access the 
elements of an aggregate object 
sequentially without exposing its 


underlying representation. The 


create_iterator refactoring generates 
an iterator class. 


While design patterns are useful when included in an 
initial software design, they are often applied in the 
maintenance phase of the software lifecycle [Gam93]. 
For example, the original designer may have been 
unaware of a pattern or additional system requirements 
may arise that require unanticipated flexibility. 
Alternatively, patterns may lead to extra levels of 
indirection and complexity inappropriate for the first 
software release. A number of patterns can be viewed 
as automatable program transformations applied to an 
evolving design. Examples for the following two 
patterns have been documented: 


Abstract [Tok95] 
Factory 


Abstract Factory provides an 
interface for creating fami- 


lies of related or dependent 
objects without specifying 
their concrete class. 





Sth USENIX Conference on Object-Oriented Technologies and Systems (COOTS '99) 


193 


- 
] 












Visitor lets you define a new 
operation without changing 
the classes of the elements 
on which it operates. 







At least five additional patterns from [Gam95] can be 
viewed as a program transformations: 


Bridge 
implementation so that the two can vary 
independently. 


Builder Builder separates the construction of a 
complex object from its representation 
so that the same construction process 
can create different representations. 


Strategy Strategy lets algorithms vary indepen- 
dently from the clients that use them. 


Template | Template Method lets subclasses rede- 
Method fine certain steps of an algorithm with- 
out changing the algorithm structure. 






Adapter lets classes work together that 
couldn’t otherwise because of incom- 
patible interfaces. 







Bridge decouples an abstraction from its 








In all cases, we can apply refactorings to simple designs 
to create the designs used as prototypical examples in 
[Gam95]. The following sections show how the first 
two patterns can be automated. 


3.2.1 Adapter 


Adapter lets classes work together that couldn’t 
otherwise because of incompatible interfaces. In the 
object adapter example from [Gam95] (Figure 3.4), the 
TextShape class adapts TextView’s GetExtent () 
method to implement BoundingBox(). The adapter 
can be constructed from the original TextView class 
(Figure 3.5) in five steps: 


1. Create the classes TextShape and Shape using 
create_class. 


2. Make TextShape a subclass of Shape using inherit 
(Figure 3.6). 


3. Add the text instance variable to TextShape using 
add_variable (Figure 3.7). 
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Figure 3.4: TextShape adapts TextView’s 
interface 










GetExtent() 


Figure 3.5: Unadapted TextView class 
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TextShape 






TextView 


GetExtent() 


Figure 3.6: Adapter class created 
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TextView 


TextShape 





GetExtent() 


Figure 3.7: Adaptee instance variable added 
to adapter 


Create the BoundingBox() method which calls 
text->GetExtent () using 
create_method_accessor. 
Create_accessor_method creates a method which 
replaces calls of the form instance_variable- 
>method(). 


Declare BoundingBox() in Shape using 
declare_virtual_method (Figure 3.4). 
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Figure 3.8: Bridge design pattern example 


imp->DrawLine(); 
imp->DrawLine(); 
imp->DrawLine(); 
imp->DrawLine(); 
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Figure 3.9: Design for a single window system 


3.2.2 Bridge 


Bridge decouples an _ abstraction from its 
implementation so that the two can vary independently. 
In the example from [Gam95] (Figure 3.8), the Window 
abstraction and WindowImp implementation are placed 
in separate hierarchies. All operations on Window 
subclasses are implemented in terms of abstract 
operations from the WindowImp interface. Only the 
WindowImp hierarchy needs to be extended to support 
another windowing system. We refer to the relationship 
between Window and WindowImp as a bridge because it 
bridges the abstraction and its implementation, allowing 
them to vary independently. 


Refactorings can be used to install a bridge design 
pattern given a simple design committed to a single 
window system. Figure 3.9 depicts a system designed 
for X-Windows. This system can be evolved with 


WindowImp 
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XDrawLine(); ~ 
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Figure 3.10: Implementor classes created 
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DrawLine(); 
DrawLine(); 






DrawLine(); 
DrawLine(); 





Figure 3.11: Implementor instance variable 
added to Window 


refactorings to use the bridge design pattern in seven 
steps: 


1. Create classes XWindow and WindowImp using 
create_class. 


2. Make WindowImp a superclass of XWindow with 
inherit (Figure 3.10). 


3. Add instance variable imp to the Window class 
using add_variable (Figure 3.11). 


4. Move methods DrawLine() and DrawText() to 
the XWindow class using the refactoring 
move_method_across_object_boundary. These 
methods are accessed through the imp instance 
variable (Figure 3.12 ). 


5. Declare method DrawLine() and DrawText() in 
WindowImp with declare_abstract_method. 
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Figure 3.12: Window system specific methods 
moved to XWindow class 










imp->DrawLine(); 
imp->DrawLine(); 


imp->DrawLine(); 
imp->DrawLine(); 
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Figure 3.13: Virtual methods declared so that 
"imp" can be generalized to 
class WindowImp 










imp->DrawLine(); 
imp->DrawLine(); 


imp->DrawLine(); 
imp->DrawLine(); 





6. Change the type of instance variable imp from 
XWindow to WindowImp using substitute (Figure 
0.13). 


7. Adda DrawText() method to Window which calls 
DrawText () in WindowImp using 
create_method_accessor (Figure 3.8). 


The Bridge architecture uses object composition to 
provide needed flexibility. Object composition is also 
present in the Builder and Strategy design patterns. The 
trade-offs between use of inheritance and object 
composition are discussed in [Gam95, pp. 18-20]. 
Refactorings allow a designer to safely migrate from 
statically checkable designs using inheritance to 
dynamically defined designs using object-composition. 
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Figure 3.14: Overenthusiastic use of design 
patterns 
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bounds() 
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3.2.3 Role of Refactorings for Design Pat- 
terns 


Gamma et. al. note that a common design pattern pitfall 
is overenthusiasm: "Patterns have costs (indirection, 
complexity) therefore [one should] design to be as 
flexible as needed, not as flexible as possible." The 
example from [Gam96] is displayed in Figure 3.14. 
Instead of creating a simple Circle class, an 
overenthusiastic designer adds a Circle factory with 
strategies for each method, a bridge to a Circle 
implementation, and a Circle decorator. The design is 
likely to be more complex and inefficient that what is 
actually required. The migration from a single Circle 
class to the complex microarchitecture in Figure 3.14 
can be viewed as a transformation. This transformation 


is in fact automatable with refactorings*, Thus, instead 
of overdesigning, one can start with a simple Circle 
class and add the Factory Method, Strategy, Bridge, and 
Decorator design patterns as needed. 


Refactorings can restructure existing implementations 
to make them more flexible, dynamic, and reusable, 
however, their ability to affect algorithms is limited. 
Patterns such as Chain of Responsibility and Memento 
require that algorithms be designed with knowledge 
about the patterns employed. These patterns are thus 
considered fundamental to a software architecture 
because there is no refactoring enabled evolutionary 
path which leads to their use. Refactorings allow a 
designer to focus on fundamental patterns when 


4. A Circle factory is created [Tok95]. Strategies are 
added (Section 3.2). The Bridge pattern is applied 
(Section 3.2.2). Finally, a decorator is added (Sec- 
tion 3.2). 
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Figure 3.15: Initial state of mailing system 


creating a new software architecture. Patterns supported 
through refactorings can be added on an if-needed basis 
to the current or future architecture at minimal cost. 


3.3 Hot-Spot Analysis 


The hot-spot-driven-approach [Pre94] identifies which 
aspects of a framework are likely to differ from 
application to application. These aspects are called hot- 
spots. When a data hot-spot is identified, abstract 
classes are introduced. When a functional hot-spot is 
identified, extra methods and classes are introduced. 


3.3.1 Data Hot-Spots 


When the instance variables between applications are 
likely to differ, Pree prescribed the creation of abstract 
classes. Refactorings have repeatedly demonstrated the 
ability to create abstract classes [Opd93, Tok95, 
Rob97]. As an example, Pree and Sikora provide a 
Mailing System case study [Pre95]. Figure 3.15 
displays the initial state of its software architecture. In 
this system, Folder cannot be nested, and only 
TextDocument can be mailed. Their suggested 
architecture is displayed in Figure 3.16. Under the 
improved architecture, Folders can be nested and any 
subclass of DesktopItem can be mailed. Refactorings can 
automate these changes in five steps: 


1. Create a DesktopItem class using create_class 
(Figure 3.17). 


2. Make DesktopItem a superclass of TextDocument 
using inherit (Figure 3.18). 


3. Generalize the link between Mailer and 
TextDocument to a link between Mailer and 
DesktopItem using substitute (Figure 3.19). 
Subclasses of DesktopItem can now be mailed. 


Figure 3.16: Final state of mailing system 


DesktopItem 
O 









Figure 3.17: Empty TextDocument class 
created 





Figure 3.18: TextDocument inherits from 
DesktopItem 


4. Generalize the link between Folder and 
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Figure 3.19: Mailer dependency changed from 
TextDocument to DesktopItem 





Figure 3.20: Folder can contain any DesktopItem 


TextDocument to a link between Folder and 
DesktopItem using substitute (Figure 3.20). Folder 
can now contain any DesktopItem. 


5. Make Folder a subclass of DesktopItem using 
inherit (Figure 3.16). A Folder which can contain a 
DesktopItem can now contain another Folder. 


With the improved architecture, a Folder can be nested 
within another Folder and DesktopItem provides a 
superclass for adding other types of media to be mailed. 
These changes which would normally be implemented 
and tested by hand can be automated with refactorings. 


3.3.2 Functional Hot-Spots 


For the case of differing functionality, solutions based 
on template and hook methods are prescribed to provide 
the needed behavior. A template method provides the 
skeleton for a behavior. A hook method is called by the 
template method and can be tailored to provide 
different behaviors. Figure 3.21 is an example of a 
template method and hook method defined in the same 
class. Different subclasses of T can override hook 
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while (...) 









IS 
do special behavior 


Figure 3.21: Template and hook methods in 
same class 





while (...) 
M2() 
do... 


IN 
do special behavior 


Figure 3.22: Hook method M2() overridden 
in class H 





method M2() which leads to differing functionality in 
template method M1(). (Figure 3.22). Pree identifies 
seven meta patterns for template and hook methods: 
unification, 1:1 connection, 1:N connection, 1:1 
recursive connection, 1:N recursive connection, 1:1 
recursive unification, and 1:N recursive unification 
[Pre94]. Refactorings automate the introduction of meta 
patterns into evolving architectures. The transitions 
between patterns enabled by refactorings are displayed 


in Figure 3.23°. As examples, we demonstrate support 
for the first two transitions. 


In the unification composition, both the template and 
hook methods are located in the same class (Figure 
3.21). The behavior of the template is changed by 
overriding the hook method in a subclass (Figure 3.22). 
An architecture with no template or hook methods can 
be transformed to use the unification meta pattern 
(transition | from Figure 3.23). Consider the class 


5. We consider the 1:N connection composition to be 
fundamental to an architecture. For this pattern, a 
template object is linked to a collection of hook 
objects. This implies that the template method has 
knowledge about how to use multiple hook methods 
and thus cannot be derived from the 1:1 connection 
composition in which the template method is coded 
for a single hook method. 
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Figure 3.23: Hot-spot meta pattern transitions 
enabled by refactorings 









while (...) 
do special behavior 





Figure 3.24: Method M1() calls a special behavior 
which differs for each application 






while (...) 
M2() 
do ... 


IN 
do special behavior 


Figure 3.25: Hook class created 





diagram in Figure 3.24 with class T having method M1() 
which calls some special behavior. A hook method can 
be added with refactorings in one step: 


1. Create a hook method M2() which executes the 
special behavior using extract_code_as_method 
(Figure 3.21). Extract_code_as_method replaces a 
block of code with a call to a newly created method 
which executes the block. 


In the new microarchitecture, general behavior is 
contained in template method m1() while special 
behavior is captured by hook method M2 (). To extend 
the architecture, subclasses of T override M2() to 
provide alternative behaviors for M1(). The extended 
structure can be added in four steps: 


1. Create class H using create_class. 


ref->M2() 
do ... 








“een 


while (...) 


IN 
ee do special behavior 


Figure 3.27: Connection to H object 
created 


ref->M2() 
do... 





2. Make T a superclass of H using inherit (Figure 
3:25). 


3. Make M2() overridable by the subclasses of T using 
declare_abstract_method. 


4. Move the implementation of M20 into H using 
push_down_method (Figure 3.22). 


As a second example, we support the transition from 
unification to 1:1 connection (transition 2 from Figure 
3.23). Consider the 1:1 connection meta pattern which 
stores the hook method in an object owned by the 
template class (Figure 3.26). Behavior can be changed 
at run-time by assigning a hook object with a different 
behavior to the template class. 1:1 connection can be 
automated in three steps using the unification pattern 
(Figure 3.21) as a starting point. 


1. Create class H using create_class. 


2. Add an instance variable ref to T~ with 
add_variable (Figure 3.27). 
3. Move M2() to class H using 


move_method_across_object_boundary (Figure 
3.26). 


The behavior of template method M1() can now be 
altered dynamically by pointing to different hook class 
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objects with different implementations of M2 (). Other 
transitions in Figure 3.23 are similarly supported. 


3.3.3, Role of Refactorings for Hot-Spot 
Analysis 


The hot-spot-driven-approach provide a comprehensive 
method for evolving designs to manage change in both 
data and functionality. Pree notes that "the seven 
composition meta patterns repeatedly occur in 
frameworks." Thus, we expect an ongoing need to add 
meta patterns to evolving architectures. The addition of 
meta patterns is currently a manual process. Conditions 
are checked to ensure that a pattern can be added safely, 
lines of affected source code are identified, changes are 
coded, the system is tested to check for errors, any 
errors are fixed and the system is retested. Retesting 
continues until the expected likelihood of an error is 
sufficiently low. 


This section demonstrates that most meta patterns can 
be viewed as transformations from a simpler design. 
Refactorings automate the transition between designs 
granting designers the freedom to create simple 
frameworks and add patterns as needed when hot-spots 
are identified, 


4 Related Work 


Griswold developed behavior-preserving 
transformations for structured programs written in 
Scheme [Gri91]. The goal of this system was to assist 
in the restructuring of functionally decomposed 
software. Software architectures developed using the 
classic structured software design methodology 
[You79] are difficult to restructure because nodes of the 
structure chart which define the program pass both data 
and control information. The presence of control 
information makes it difficult to relocate subtrees of the 
structure chart. As a result, most transformations are 
limited to the level of a function or a block of code. 


Object-oriented software architectures offer greater 
possibilities for restructuring. Bergstein defined a small 
set of object-preserving class transformations which can 
be applied to class diagrams [Ber91]. Lieberherr 
implemented these transformations in the Demeter 
object-oriented software environment [Lie91]. Example 
transformations are deleting useless subclasses and 
moving instance variables between a superclass and a 
subclass. Bergstein’s transformations are object 
preserving so they cannot add, delete, or move methods 
or instance variables exported by a class. 
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Banerjee and Kim identified a set of schema 
transformations which accounted for many changes to 
evolving object-oriented database schema [Ban87]. 
Opdyke defined a parallel set of behavior-preserving 
transformations for object-oriented applications based 
on the work by Banerjee and Kim, the design principles 
of Johnson and Foote [Joh88], and the design history of 
the UIUC Choices software system [May89]. These 
transformations were termed refactorings. Roberts 
developed the Smalltalk Refactory Browser which 
implements many of these refactorings [Rob97]. 


Tokuda and Batory proposed additional refactorings to 
support design patterns as targets states for software 
restructuring efforts [Tok95]. Refactorings are shown to 
support the addition of design patterns to object- 
oriented applications [Tok95, Rob97, Sch98]. Winsen 
used refactorings to make design patterns more explicit 
[Win96]. Tokuda and Batory demonstrated — that 
refactorings can automate significant (greater than 10K 
lines of code) changes when applied to real applications 
[Tok99]. 


A number of tools instantiate a design pattern and insert 
it into existing source code [Bud96, Kim96, Flo97]. 
Instantiations are not necessarily refactorings, so testing 
of any changes may be required. Florijn and Meijers 
check invariants governing a pattern and_ repairs 
violations when possible. Refactorings do not have this 
pattern-level knowledge. 


5 Summary 


Architectural evolution is a costly yet unavoidable 
consequence of a successful application. One method 
for reducing cost is to automate aspects of the 
evolutionary cycle when possible. For object-oriented 
applications in particular, there are regular patterns by 
which architectures evolve. Three modes of 
architectural evolution are: schema transformations, the 
introduction of design pattern microarchitectures, and 
the hot-spot-driven-approach. Many — evolutionary 
changes can be viewed as program transformations 
which are  automatable with object-oriented 
refactorings. Refactorings are superior to hand-coding 
because they check enabling conditions to ensure that a 
change can be made safely, identify all lines of source 
code affected by a change, and perform all edits. 
Refactorings allow architectural evolution to occur at 
the level of a class diagram and leave the code-level 
details to automation. 


Architectures should evolve on an if-needed basis: 
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* "Complex systems that work evolved from simple 
systems that worked." — Booch 


¢ "Start stupid and evolve." — Beck 


Refactorings directly address the need to evolve from 
simple to complex designs by automating many 
common design transitions. We believe that the 
majority of all object-oriented applications undergoes 
some form of automatable evolution. The broad scope 
of supported changes indicates that refactorings can 
have a significant impact when applied to evolving 
designs. This claim is validated with real applications in 
[Tok99] where many hand-coded changes between two 
major releases of two software systems are automated. 


The limiting factor barring the widespread acceptance 
of refactoring technology appears to be the availability 
of production quality refactorings for the two most 
popular object-oriented languages: C++ and Java. Our 
current research identifies implementation issues for 
C++ [Tok99]. 
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Abstract 


Several reflective architectures have attempted to 
improve meta-object reuse by supporting composi- 
tion of meta-objects, but have done so using limited 
mechanisms such as Chains of Responsibility. We 
advocate the adoption of the Composite pattern to 
define meta-configurations. In the meta-object pro- 
tocol (MOP) of Guarana, a composer meta-object 
can control reconfiguration of its component meta- 
objects and their interactions with base-level ob- 
jects, resolving conflicts that may arise and estab- 
lishing meta-level security policies. 


Guarana is currently implemented as an extension 
of Kaffe OpenVM™, a free implementation of the 
Java! Virtual Machine. Nevertheless, most design 
decisions presented in this paper can be transported 
to other programming languages and MOPs, im- 
proving their flexibility, reconfigurability, security 
and meta-level code reuse. We present performance 
figures that show that it is possible to introduce 
run-time reflection support in a language like Java 
without much impact on execution speed. 


1 Introduction 


Object-oriented design is based on abstraction and 
information hiding (encapsulation). These concepts 
have provided an effective framework for the man- 
agement of complexity of applications. Within this 
framework, software developers strive to obtain ap- 
plications that are highly coherent and loosely cou- 
pled. Unfortunately, object orientation alone does 


1 Java is a trademark of Sun Microsystems, Inc. 


not address the development of software that can be 
easily adapted. 


The concept of open architectures [6, 7] has been 
proposed as a partial solution to the problem of 
creating software that is not only modular, well- 
structured, but also easier to adapt. Open archi- 
tectures encourage a modular design where there is 
a clear separation of policy, that is, what a module 
has been designed for, from the mechanisms that 
implement a policy, that is, how a policy is material- 
ized. The implementation of system-oriented mech- 
anisms such as concurrency control, distribution, 
persistence and fault-tolerance can benefit from this 
approach to software construction. 


Computational reflection [13, 21] (henceforth just 
reflection) has been proposed as a solution to the 
problem of creating applications that are able to 
maintain, use and change representations of their 
own designs (structural or behavioral). Reflective 
systems are able to use self-representations to ex- 
tend and adapt their computation. Due to this 
property, they are being used to implement open 
software architectures. In reflective architectures, 
components that deal with the processing of self- 
representation and management of an application 
reside in a software layer called meta-level. Compo- 
nents that deal with the functionality of the applica- 
tion are assigned to a software layer called base-level. 
In object-oriented reflective systems, meta-level ob- 
jects that implement management policies are called 
meta-objects. 


Due to their inherent structure, the existing reflec- 
tive architectures and MOPs may induce developers 
to create complex meta-objects that, in an all-in-one 
approach, implement many management aspects of 
an application or, alternatively, to construct coher- 
ent but tightly coupled meta-objects. Both alterna- 
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tives make reuse, maintenance and adaptation of an 
application harder, especially of its meta-level, the 
layer in which most of the adaptations tend to occur 
in an open architecture. 


In contrast, Guarana [20] allows meta-objects to 
be combined through the use of composers. Com- 
posers [17] are meta-objects that can be used to de- 
fine arbitrary policies for delegating control to other 
meta-objects, including other composers. They pro- 
vide the glue code to combine meta-objects, and to 
resolve conflicts between incompatible ones. The 
use of composers encourages the separation of the 
structure of the meta level from the implementation 
of individual management aspects. 


Our implementation of Guarana, based on a Java 
interpreter that supports just-in-time compilation, 
has shown that it is possible to introduce intercep- 
tion mechanisms, essential for the deployment of 
behavioral reflection, with a small overhead. We 
believe that this overhead is a minor drawback, 
when compared with the flexibility introduced by 
our MOP. 


This paper is structured as follows. In the next sec- 
tion, we discuss some related works. In Section 3, 
we present the reflective architecture of Guarana. 
Section 4 contains a short description of our imple- 
mentation of this architecture, extending a freely- 
available Java Virtual Machine. In Section 5, we 
present some figures about the impact of Guarana 
on the performance of applications. Section 6 lists 
some possible future optimizations for our imple- 
mentation of Guarana. Finally, in Section 7, we 
summarize the main points of the paper. 


2 Related Work 


The development of generic mechanisms for the 
composition of meta-objects is still in its initial 
stages. OpenC++ [2] does not provide direct 
support for composition. MOOSTRAP [16] and 
MetaXa [9] (formerly known as MetaJava) support 
sequential composition of similar meta-objects. We 
say that meta-objects are similar if they implement 
the same interface. 


Apertos [23] and CodA [14] assign aspects of base- 
level execution, such as sending, receiving and 
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When a Client requests an operation Op from 
a Server object, the operation is intercepted, 
reified (represented as an object) and presented 
to a Meta-Object. It may choose to deliver a 
different operation Op' to the Server, obtain- 
ing the result R’, that is also reified. Having 
delivered an operation or not, it must reply 
with a result R, that is unreified and returned 
to the Client. 


Figure 1: Basic interception. 


scheduling operations, to specialized, dissimilar 
meta-objects. A pre-determined set of aspects can 
be extended, through intrusive modification of the 
implementation of the meta-objects responsible for 
them. We consider this a primitive mechanism of 
composition, that fails in the general case, because 
the modifications are very likely to clash. 


Several run-time MOPs have been designed so that, 
when a meta-object is requested to handle a reified 
operation (for example, a method invocation), it is 
obliged, by the design of the MOP, to return a valid 
result for the operation (typically the value returned 
by the method), as shown in Figure 1. The meta- 
level computation that yields the result can include 
or not the delivery of the operation to the base-level 
object. 


This design implies that the only way to combine 
the behavior of meta-objects is by arranging for one 
meta-object, say MO, to forward operation han- 
dling requests to another, say MO, delegating to 
MOz the responsibility for computing the result of 
the operation. Only after MO, returns a result will 
MO, be able to observe and/or to modify it. 


Given such a protocol, meta-objects are likely to 
be organized in a Chain of Responsibility [5, chap- 
ter 5], so that each meta-object delegates operation 
handling requests to its successor, as depicted in 
Figure 2. The last element of the chain is either the 
base-level object [9] or a special meta-object that 
delivers operations to it [16]. We argue that this 
design presents some serious drawbacks: 
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Given the basic interception mechanism of 
Figure 1, meta-objects can only be composed 
with a Chain of Responsibility [5, chapter 5], 
a sequential delegation pattern. 


Figure 2: Chain of Meta-Objects. 


e it is intrusive upon the meta-object implemen- 
tation, in the sense that a meta-object must 
explicitly forward operations to its successor; 


e it forbids multiple meta-objects from concur- 
rently handling the same operation, because, 
at a given moment, at most one meta-object 
can be responsible for producing a result or de- 
livering the operation to the base level; 


e it forces meta-objects to receive the results of 
operations they handled, even if they are not 
interested in them; 


e the order of presentation of results is necessar- 
ily the reverse order of the reception of oper- 
ations, even though different (possibly concur- 
rent) orderings might be more appropriate or 
efficient, according to the semantics and the re- 
quirements of the application; 


e it is impossible to mediate interactions be- 
tween meta-objects and base-level objects with 
an adaptor capable of resolving conflicts that 
might arise when multiple meta-objects are put 
to work together. 


Even AspectJ [11, 12], an aspect-oriented program- 
ming [8] extension of Java, lacks the possibility of 
introducing such an adaptor to manage conflicting 
weaves of aspects so that they can coexist. 


3 The Reflective Architecture of 
Guarana 


The problems presented in the end of Section 2 are 
solved in the MOP of Guarana by splitting the 
meta-level processing associated with a base-level 
operation in the following steps: 
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3(f) perform operation 


This figure presents the basic MOP of 
Guarand: although a meta-object is allowed 
return a result when requested to handle an op- 
eration (a), it may prefer to return an opera- 
tion to be performed (b), with or without an in- 
dication that it is interested in its result (c). If 
it is, it will be presented the result after the ex- 
ecution of the operation (d). Meta-objects can 
use operation factories to create operations (e) 
that can replace other operations (b,c) or be 
performed as stand-alone ones (f). 


Figure 3: Operations and Results. 


1. If the target object of the operation is as- 
sociated with a meta-object, the kernel of 
Guarand—the entity that implements the 
MOP-— intercepts and reifies the operation and 
requests the meta-object to handle it; other- 
wise, no meta-level computation occurs, reduc- 
ing the overhead for non-reflective objects. 


2. A meta-object may produce a result for an op- 
eration, as in Figure 3(a). In this case, the 
meta-level processing terminates by unreifying 
the result as if it had been produced by the 
execution of the intercepted operation. 


3. However, the meta-object is not required to re- 
ply with a result. This permission is essen- 
tial because it cannot deliver the operation to 
the base-level object. Instead, it should reply 
with an operation to be delivered to the base 
level (Figure 3(b)) —usually the operation it 
was requested to handle— and with an indi- 
cation of whether it is interested in observing 
and/or modifying the result of the operation 


(Figure 3(c)). 
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4. Finally, the operation is delivered to the base 
level, and its result may or may not be pre- 
sented to the meta-object, depending on its pre- 
vious reply (Figure 3(d)). If it had requested 
for permission to modify the result, it may now 
reply with a different result for the operation. 


Replacement operations can be created in the meta- 
level using operation factories, as in Figure 3(e). 
Operation factories allow meta-objects to obtain 
privileged access to the base-level objects they man- 
age. Stand-alone operations can also be created 
with operation factories, and then performed, i.e., 
submitted for interception, meta-level processing 
and potential delivery for base-level execution, as 
in Figure 3(f). 


We have been able to define composers by separat- 
ing operation handling from result handling, imple- 
mented in two distinct methods, namely, handle op- 
eration and handle result. A composer is a meta- 
object that delegates operations and results to mul- 
tiple meta-objects, then composes their replies in 
its own replies. For example, a composer can im- 
plement the chain of meta-objects presented before, 
but in a way that one meta-object does not have 
to keep track of its successor. Another implemen- 
tation of composer may delegate operations and/or 
results concurrently to multiple meta-objects, or re- 
frain from delegating an operation to some meta- 
objects if it is aware they are not interested in that 
operation. 


In Guarana, at any given moment, each object can 
be directly associated with at most one meta-object, 
called its primary meta-object. If there is no such 
association, operations addressed to that object are 
not intercepted, and we say that the object is not 
reflective at that moment. 


The fact that Guarana associates a single (pri- 
mary) meta-object with an object keeps the design 
of the interception mechanism simple. Since the pri- 
mary meta-object can be a composer, as can any 
meta-object it delegates to, multiple meta-objects 
can reflect upon an object. These meta-objects form 
a Composite pattern [5, chapter 4] that we call the 
meta-configuration of that object (Figure 4), a po- 
tentially infinite hierarchy of composition that is or- 
thogonal to the well-known infinite tower of meta- 
objects [13]. 
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meta-meta 


Composer 


Primary 
Meta-Object 


The meta-configuration of O; is elaborate: a 
composer, called its primary meta-object, dele- 
gates to three other meta-objects, one of which 
is a composer itself, and delegates to two other 
meta-objects. Oz is not associated with any 
meta-object, so its operations are not inter- 
cepted; it is not reflective. O3 shares MO with 
O:. MO; is a reflective meta-object, since it 
has its own (meta-)meta-configuration. 


Figure 4: Meta-configurations. 


3.1 Meta-configuration management 


Guarana presents two additional features that en- 
force the separation of concerns between the base 
level and the meta level: (i) the meta configuration 
of an object is completely hidden from the base level 
and even from the meta level itself; and (ii) the ini- 
tial meta-configuration of an object is determined 
by the meta-configurations of its creator and of its 
class, a mechanism we call meta-configuration prop- 
agation. 


The first design decision implies that there is no way 
to find out what is the primary meta-object associ- 
ated with an object. It is possible, however, to send 
arbitrary messages and reconfiguration requests to 
the components of the meta-configuration of an ob- 
ject, through the kernel of Guarana. 


Messages can be used to extend the MOP of 
Guarana, as they allow meta-objects to exchange 
information even if they do not hold references to 
each other. Meta-objects that do not understand a 
message are supposed to ignore it, and composers 
are expected to forward messages to their compo- 
nents, as in Figure 5. The kernel operation that 
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Any object M (for message) can be sent to 
the primary meta-object of an object O. Com- 
posers usually forward messages to their com- 
ponents. For non-reflective objects, this re- 
quest is ignored. 


Figure 5: Broadcasting a message. 


Guarana.reconfigure(0,M03 ,M0g) ; 


reconfigure 
(0, M03 ,MOg) 





A request to replace MO3 with MOg in the 
meta-configuration of O was issued. As the 
request descends the composition hierarchy, it 
reaches the target meta-object. In this case, it 
agrees to be replaced, by returning the proposed 
meta-object. A meta-object must return itself 
in order to ignore the request, as C; does, oth- 
erwise the returned meta-object will replace it. 


Figure 6: Dynamic reconfiguration. 


implements this mechanism is called broadcast. 


A reconfiguration request (Figure 6) carries a pair 
of meta-objects, suggesting that the first meta- 
object (1403) should be replaced with the second 
(Mg) in the meta-configuration of object 0. A spe- 
cial value (null) can be used to refer to the pri- 
mary meta-object. It is up to the existing meta- 
configuration to decide whether the request is ac- 
ceptable or not. However, if the base-level ob- 
ject is not reflective, an InstanceReconfigure mes- 





Guarana.reconfigure (Object ,null,MO) ; 
; © Object 
M:InstanceReconf igure 
°MO 





The null meta-object can be used as an alias 
for the primary meta-object in reconfigura- 
tion requests. When the object is not reflec- 
tive, the meta-configuration of its class will 
be given the opportunity to affect the proposed 


meta-configuration of the instance. An In- 
stanceReconfigure message will carry the pro- 
posed meta-object, so that meta-objects of the 
object’s class(es) may modify it. The remain- 
ing meta-object will become the object’s pri- 
mary meta-object. 


Figure 7: Reconfiguration of a non-reflective object. 


sage is broadcast to the meta-configurations of its 
class and of its superclasses, as depicted in Fig- 
ure 7. Their components can modify the suggested 
meta-configuration, for example, forcing it to re- 
main empty. 


In most object-oriented programming languages, 
creating an object consists of two steps: (i) allocat- 
ing storage for the object, possibly initialized with 
default values, then (ii) invoking its constructor. We 
say that these steps are performed by the creator of 
the object. 


Meta-configuration propagation takes place between 
these two steps in Guarana. The primary meta- 
object of the creator is responsible for providing a 
meta-object for the new object. It may return null, 
a different meta-object or even itself, as a meta- 
object can belong to multiple meta-configurations. 
A composer is expected to forward this request to 
its components and to create a composer that del- 
egates to the meta-objects returned by them, as in 
Figure 8. 


After meta-configuration propagation, the kernel of 
Guarana broadcasts a NewObject message to the 
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Creator 


When a reflective object instantiates another 
object, its meta-configuration may propagate 
to the new object before the object is initial- 
ized. In fact, the meta-configuration does not 
have to propagate as a whole: in the picture, 
only MOs was effectively propagated; MO2 was 
discarded, whereas MO; named MO, to occupy 
its place in the meta-configuration of the new 
object. C; created a new composer to delegate 
to MO4 and MO3. 


Figure 8: Meta-configuration propagation. 


meta-configuration of the class of the new object, 
so that its meta-objects can try to reconfigure it, 
as shown in Figure 9. Finally, the object is con- 
structed, but the constructor invocation will be in- 
tercepted if the new object has become reflective. 


3.2 Support for proxy objects 


Guarana provides a mechanism that allows prory 
objects to be created from the meta level, without 
invoking their constructors. In addition to the tra- 
ditional use of a proxy, namely, for representing an 
object from another address space, a proxy can be 
used to reincarnate an object from persistent stor- 
age, to migrate an object, etc. 


When a proxy is created, as in Figure 10, the kernel 
of Guarana broadcasts to the meta-configuration 
of its class a NewProxy message, a subclass of 
NewObject. A proxy will usually be given a meta- 
configuration that prevents operations from reach- 
ing it, but it may be transformed in a real object by 
its meta-configuration, through constructor invoca- 
tion or direct initialization. 
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After meta-configuration propagation, the 
meta-configuration of the class of a new ob- 
ject is notified about the new instance, with a 
NewObject message, so that it can try to af- 
fect the meta-configuration of its instances, by 
issuing reconfiguration requests. 


Figure 9: NewObject messages. 


Object = Guarana.makeProxy (Class ,MOp) ; 
reconfigure (Object ,null,MOp) ; 


It is possible to request the creation of a 
prozy object of any class. As soon as the 
prozy is created, a NewProxy message, sub- 
class of NewObject, is broadcast to the meta- 
configuration of the class, so that it can take 
control over the prozy before the proposed 
meta-object does. Afterwards, a reconfigure 
request is automatically issued to try to install 
the proposed meta-object as the primary meta- 
object of the proxy. 


M: NewProxy 


NewObject © Object 





Figure 10: Proxy objects. 


3.3 Security 


Another advantage of the MOP of Guarana is its 
concern with security. The hierarchy of composi- 
tion can be used to limit the ability of a meta-object 
to affect a base-level object. For example, a com- 
poser may decide not to present an operation to a 
meta-object, or to ignore results or replacement op- 
erations it produces. The composer can withhold a 
message to a component, reject a meta-object pro- 
duced by a component at a reconfiguration or prop- 
agation request, or provide restrictive operation fac- 
tories to its components, thus limiting their ability 
to create operations. Furthermore, since the iden- 
tity of the primary meta-object of an object is not 
exposed, the hierarchy cannot be subverted. 
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4 Implementation 


We had originally intended to implement Guarana 
in 100% Pure Java, either by writing an extended 
Java interpreter in Java or by introducing intercep- 
tion mechanisms through a bytecode preprocessor. 
The first alternative was discarded because it could 
imply poor performance and difficulties in handling 
native methods [22]. A bytecode preprocessor im- 
plementation was not possible either, due to restric- 
tions imposed by the Java bytecode verifier [10] and 
the impossibility to rename native methods, needed 
in order to ensure their interception. 


Therefore, we have decided to implement Guarana 
by modifying the Kaffe OpenVM™, an open-source 
Java Virtual Machine. Most of Guarana is coded 
in Java, but the Java Virtual Machine has suffered 
a very minor and localized modification, in order 
to provide for interception of operations. The per- 
formance impact due to the modification was quite 
small (Section 5) especially when compared to the 
benefit of transparent interception of method invo- 
cations, field and array accesses, object instantia- 
tion, and monitor primitives. 


The Java Programming Language, however, has not 
been modified. Thus, any Java program, compiled 
with any Java compiler, will run on our implementa- 
tion, within the limitations of the Kaffe OpenVM, 
the most portable existing Java Virtual Machine. 
We consider this aspect of Guarana yet another 
benefit of our approach as programmers will be 
able to use the reflective mechanisms provided to 
adapt Java programs originally implemented in the 
absence of any concern with reflection, even with- 
out access to the program’s source code. This is 
possible by starting a meta-application to set up 
meta-configurations of application classes and ob- 
jects before the application runs. Then, the meta- 
application starts the application, but it can still 
control it through interception, meta-configuration 
propagation and instance reconfiguration messages. 
Guarana also provides probe meta-objects that can 
be helpful for figuring out the behavior of certain 
objects, so that they can be properly configured. 


The MOP of Guarana can also be implemented 
in other object-oriented programming languages, or 
even upon existing reflective platforms, as an exten- 
sion to their built-in MOPs. However, some partic- 
ular features of Guarana may be difficult to dupli- 


cate, if some design decisions for the target language 
or MOP conflict with those of Guarana. 


Java 1.1 was an excellent choice as a target language 
for Guarana, because it already provides some re- 
flective properties, such as the ability to represent 
classes, methods and fields as objects (i.e., these el- 
ements of the language are reified), so that it is pos- 
sible to navigate a class hierarchy (introspection) 
and even interact with objects using the Java Core 
Reflection API to reflectively invoke methods and 
to get or set the value of fields. However, such in- 
teractions are restricted by the language access con- 
trol rules, mimicked at run-time. In Java 2, access 
control can be supressed for particular instances of 
Methods and Fields, allowing an instance of class 
that is able to perform the access to supply privi- 
leged access to other objects. Other than that, the 
Reflection API allows an object to perform only the 
operations that it would have been allowed to per- 
form directly in source code, i.e., access control is 
based on class permissions. 


Guarana builds upon these features, introducing 
mechanisms for interception, that are missing in 
Java, and per-object (as opposed to per-class) se- 
curity mechanisms, so that meta-objects can obtain 
privileged access to objects they control. 


5 Performance 


We have run some performance tests to try to eval- 
uate the impact of introducing reflective capabilities 
into a Java interpreter. Like the other few papers in 
the literature on reflection that provide performance 
data, we have preferred to evaluate the overhead of 
reflection on each particular operation, instead of 
running standard benchmarks. In fact, there are no 
standard benchmarks to evaluate the impact of re- 
flection. Existing general-purpose benchmarks usu- 
ally focus on optimization of complex patterns of 
control flow, which would not be affected by the 
introduction of interception for objects operations, 
and calculations on large arrays, which would incur 
a huge overhead. 


Our tests have been performed on four different 
platforms, listed in Table 1. On the Solaris plat- 
forms, the tests were run in real-time scheduling 
mode, so as to ensure that no other processes would 
affect the measured times. On the GNU/Linux plat- 
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Table 1: Description of the platforms. 


This table describes the platforms on which the 
performance tests were run. 


100 MHz Pentium running RedHat 
Linux 5.1 

233 MHz Pentium Pro running RedHat 
Linux 5.0 

167 MHz SPARC Ultra 1 running So- 
laris 2.6 

200 MHz SPARC Ultra Enterprise 2 
running Solaris 2.5 











forms, this scheduling mechanism was not available, 
so we just ensured that the tested hosts were as 
lightly loaded as possible. 


On each host, we have run the same Java program, 
compiled with Sun JDK’s Java compiler, without 
optimization, to prevent method inlining. The pro- 
duced bytecodes were executed by different inter- 
preters under different configurations. 


We have used Guarana 1.4.1 and the snapshot 
of Kaffe 1.0.b1 distributed with it, using the JIT 
compiler and the interpreter engines. Kaffe and 
Guarana were compiled with EGCS 1.1b, with 
default optimization levels. The program used to 
perform the tests was the one distributed with 
Guarana 1.4.1. 


For each configuration, we have timed several differ- 
ent operations, described in Table 2. Each operation 
was timed by running it repeatedly inside a loop, af- 
ter running it once outside the loop, before starting 
the timer. This ensures that, before the loop starts, 
any JIT compilation has already taken place, all the 
data and code was brought into the cache and, un- 
less the test involves object allocation, the garbage 
collector will not run. 


This inner loop is run repeatedly, with the iteration 
count being adjusted at every outer iteration, aim- 
ing at a running time longer than 1 second. Since 
the operations that read the clock at the beginning 
and at the end of each inner loop take less than 
1 microsecond to run, and the clock resolution is 
1 millisecond, a total running time of 1 second is 
enough to elliminate any effects they might have in 
the outcome of the tests. 
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Table 2: Description of the tests. 


This table describes the operation(s) per- 
formed within a loop in our performance tests. 


emptyloop No reflective operation. 
synchronized Empty block synchronized on an 
arbitrary object. 

Invoke an empty static method 
that takes no arguments and re- 
turns void. 

Invoke a non-static private do- 
nothing method that returns void 
and takes only the implicit this 
as argument. The same byte- 
code is used to invoke construc- 
tors and, in some cases, final 
methods. 

Invoke an empty method that 
takes only the implicit this as ar- 
gument, and returns void. Dy- 
namic binding, performed with a 
dispatch table, occurs before in- 
terception test. 

Invoke the same method, but 
through an object reference of 
interface type. Dynamic bind- 
ing is much slower in this case. 
Load a static int field into a 
variable. 

Store a zero-valued variable in a 
static int field. 

Load a non-static int field into 
a variable. 

Store a zero-valued variable in a 
non-static int field. 

Load the length of an array of int 
into a variable. 

Load the first element of an array 
of int into a variable. 

Store a zero-initialized variable in 
the first element of an array of 
int. 

Print the line ‘‘Hello world!’? 
to System.err, which was redi- 
rected to /dev/null before start- 
ing the Virtual Machine. It is a 
first attempt to estimate the over- 
all impact of introducing inter- 
ception abilities. 

Compile the test program itself. 
Section 5.1 contains a detailed de- 
scription and analysis. 





































invokestatic 








invokespecial 









invokevirtual 









invokeinterface 









getstatic 





putstatic 










getfield 





putfield 





arraylength 









iaload 









lastore 
















println 
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Table 3: Overhead on interpreter. 


No interception occurs in these tests, they just 
measure the overhead imposed on the inter- 
preter to introduce the ability to intercept op- 
erations. 



















emptyloop 
synchronized 
invokestatic 
invokespecial 
invokevirtual 
invokeinterface 
getstatic 
putstatic 
getfield 
putfield 
arraylength 
iaload 
iastore 
println 
compile 


The inner-loop iteration count starts at 1, and is 
repeatedly multiplied by 10 until it is large enough 
to be measurable with the clock resolution. As soon 
as this happens, the elapsed time and the iteration 
count start to be used to estimate the running-time 
of an iteration. If the total elapsed time of an exe- 
cution of the inner loop is longer than one second, 
the estimate is the final result of the test. Other- 
wise, it is used to compute the iteration count for 
the next execution of the inner loop, aiming at a 


total execution time of 1100 milliseconds. 


With the exception of the tests println and 
compile, this mechanism selected an iteration count 
between 50,000 and 100,000,000, for the final exe- 
cution of the inner loop of each test. In the case 
of println, the iteration count was never smaller 
than 500. The compile test was run stand-alone, 


not within this framework. 


Each test case was run 50 times on each configura- 
tion and platform, and the average times of the runs 
were used to compute the relative overheads pre- 
sented in Table 3 and Table 4. Although we have 
introduced the ability to intercept operations, no 


actual interception took place during those tests. 
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Operation | 1686 | 1686 | spal_[ spa? | 


Table 4: Overhead on JIT compiler. 


No interception occurs in these tests, they 
just measure the overhead imposed on the JIT 
compiler and the code it produces to introduce 
the ability to intercept operations. 



















































emptyloop +0% +1% +0% +0% 
synchronized +12% | +10% | +27% +3% 
invokestatic +91% | +20% | +23% | +34% 
invokespecial +119% +8% | +19% | +28% 
invokevirtual +30% | +158% —6% +0% 
invokeinterface +7% +2% +3% +2% 
getstatic +68% | +148% | +163% | +163% 
putstatic +180% | +97% | +90% | +90% 
getfield +293% | +86% | +149% | +149% 
putfield +103% +96% +66% +66% 
arraylength +258% | +86% | +140% | +150% 
iaload +191% | +98% | +55% | +95% 
iastore +236% | +55% | +41% | +45% 
println +45% +6% +5% | +12% 
compile +36% | +42% | +32% | +29% 
compile-JIT +105% | +112% +81% +54% 
compile-diff +16% +17% +20% +20% 


Table 5: Total compile time. 


These are the total execution times of the 
compile test for each configuration. They 
were used to calculate the lines compile in Ta- 
ble 3 and Table 4. 

(times are in seconds) 


[Configuration | 4686 | 4686 | spui | apa? 
7 5.1 























Kaffe JIT 1 9.1 7.5 
Guarana JIT 23 12 9.6 
Kaffe interpreter 30 13 11 


Guarana interpreter 32 13 10 





5.1 The compile test 


As an additional effort to measure the performance 
impact of the introduction of interception ability, 
we have measured the execution time for the Java 
compiler to translate the test program to Java byte- 
codes. The averaged execution times are presented 
in Table 5. 


On short-running applications like this, most of 
the time is spent on virtual machine initialization 
and JIT compilation, not on running the applica- 
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[i686] spal_[ spud] 
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Table 6: JIT compilation time for compile test. 


These are the times spent on JIT compila- 
tion during the execution of the compile test. 
They were used to compute the values in the 
compile-JIT line of Table 4. 

(times are in seconds) 





a JIT 
Guarana JIT 


Table 7: Net compile time 


These are the differences between total execu- 
tion time (compile) and JIT compilation time 
(compile-JIT), i.e., the times spent on execu- 
tion of the JIT compiled code. They were used 
to compute the values in the compile-diff 
line of Table 4. 

(times are in seconds) 


[Configuration | 3586 | 4686 | spui | sp | 


Kaffe JIT 13 3.8 7.3 5.5 
Guarana JIT 16 4.5 8.8 6.7 









tion itself. The virtual machine start-up, for exam- 
ple, involves executing very large array initializa- 
tion methods, whose JIT-compilation wastes a lot 
of memory and CPU cycles, because these methods 
are executed only once. 


Although a complex program, involving several sim- 
ilar classes, is being compiled, Table 6 shows that 
more than 50% of the total time was spent on JIT- 
compiling Java Core classes and the Java compiler 
itself. Therefore, the actual overhead in execution 
time, at least for long-running applications, is much 
smaller. 


Table 7 presents the differences between the to- 
tal time and the JIT-compilation time, that repre- 
sents the time spent on running the actual applica- 
tion, i.e., the compiler. Long running applications, 
that repeatedly execute the same methods, should 
present a reflection overhead similar to the relative 
overhead of this table. 
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Table 8: Interception time, interpreter. 


This table presents the interception time of 
various operations in the Guarand inter- 
preter, with a do-nothing meta-object. Field 
operations refers to static and non-static 
field reads and writes. Array operations in- 
volve array length reads and array elements 
reads and writes. 
(times are in milliseconds) 


[ Configuration [3686 | 1686 | spat | spud | 
synchronized d 3 Ff ; 
invokestatic 
invokespecial 


invokevirtual 
invokeinterface 
field operations 
array operations 





Table 9: Interception time, JIT compiler. 


This table presents the interception time of 
various operations in the Guarané JIT com- 
piler, with a do-nothing meta-object. Other 
operations refers to all field and array opera- 
tions. 

(times are in milliseconds) 


Configuration | 1686 | 4686 | spui | spy? | 
synchronized E : A 
invokestatic 


invokespecial 
invokevirtual 
invokeinterface 
other operations 





5.2 Intercepting operations 


We have also performed some tests involving actual 
interception, using a do-nothing meta-object to in- 
tercept the operation that is the subject of each 
test. The absolute time spent on the interception 
of a single operation is presented in Table 8, for the 
interpreter, and in Table 9, for the JIT compiler. 


It is worth noting that each synchronized block 
involves two operations, one that enters the monitor 
of an object and another that leaves it. Since both 
are intercepted, the interception time is increased. 
Additional details are available elsewhere [18]. 
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5.3 Overall discussion 


In certain combinations of platform and engine, an 
operation executes faster on Guarana than on the 
corresponding combination without it. This is quite 
hard to explain, since Guarana always executes at 
least as much code as Kaffe does. The tests have 
been verified so as to ensure that the results are 
correct, and the generation of the tables from the 
test results is mostly automated, so there is little 
place for human error. The better performance can 
be attributed to factors such as improved fast-RAM 
cache hit ratio or alignment issues. 


The overhead introduced by interception on the in- 
terpreter engine is mostly small, because the inter- 
preter is usually orders of magnitude slower than 
the test for existence of a meta-object. The JIT, 
however, is severely affected by increased register 
pressure and additional register spilling and reload- 
ing. JIT-compilation costs have increased too, as 
our tests have shown, but they have only affected 
the figures of the compile test. In all other cases, 
we ensure that a method is JIT-compiled before we 
start timing its execution. 


Although the interception code has introduced mod- 
erate penalties for invoking static and private 
methods, the most common kind of invocation (non- 
final) causes a very small overhead, except on 
i686, and interface invocations are almost not af- 
fected at all. 


The bad results for some invocation bytecodes on 
one x86 platform but not on the other is unexpected, 
considering that it executes exactly the same ma- 
chine code on both. It looks like these tests in- 
troduce pathological pipeline stalls or branch pre- 
diction errors that degrade performance, since the 
average penalty, measured in compile-diff, is very 
similar on both x86 platforms, and much lower than 
most of the individual penalties. 


On the other hand, the bad results for all load 
and store operations on the JIT engines are ex- 
pected, since these instructions can usually be exe- 
cuted in one or two machine-level instructions, and 
in Guarana they require at least one more regis- 
ter and two instructions to test for the presence of 
a meta-object. Fortunately, in object-oriented ap- 
plications, field and array operations are usually in- 
tertwined with method invocations and object cre- 
ations. Since the latter operations incur a much 


smaller penalty, and they are one order of magni- 
tude slower than the former ones, the net perfor- 
mance penalty may be acceptable, as the introduc- 
tion of reflective capabilities may pay off. 


It is worth noting that, although we have introduced 
the ability to intercept object creation, we have not 
been able to measure the effect of this addition, 
due to the impredictability of the garbage collec- 
tor. Anyway, the overhead is known to be negligi- 
ble, since a single test was introduced in a rather 
complex function coded in C. 


6 Future optimizations 


The reflection overhead on the interpreter is quite 
small. Furthermore, the interpreter is much slower 
than the JIT compiler, so there is not much point 
in trying to optimize it any further. For the JIT 
code, there is little hope for similarly small over- 
heads, though. 


One approach we had considered would be to im- 
plement all operations, even field and array ones, as 


. invocations of dynamically generated JIT-compiled 


code. Then, instead of having to test the meta- 
object reference before performing an operation, an 
extended dispatch table would contain pointers to 
these JIT-generated functions, on non-reflective ob- 
jects, or to interceptor functions, in the case of re- 
flective objects. 


However, we do not think this solution would do 
very well: first, because we would have to look up 
the dispatch table before executing every single op- 
eration, as in a virtual method invocation, and the 
absolute time for a virtual method invocation is 
much larger than non-virtual method invocation, so 
we would end up increasing the cost of most opera- 
tions, instead of reducing it. 


Furthermore, invoking a function requires saving 
most registers on some ABIs, but this is not required 
when contents of memory addresses are loaded di- 
rectly, as field and array operations are currently 
implemented. In fact, because of Kaffe’s inability 
to carry register allocation information across ba- 
sic blocks, the fact that Guarana introduces basic 
blocks in field or array operations forces registers to 
be stored in stack slots because it might be neces- 
sary to invoke an interceptor function. A promising 
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optimization involves improving the register alloca- 
tion mechanism so as to propagate register alloca- 
tion information along the most frequently used con- 
trol flow, that is the one without interception, and 
move the burden of spilling and reloading registers 
into the not-so-common case in which interception 
must take place. This would decrease the cost of 
both branches, because they currently save all reg- 
isters and mark them all as unused before they join 
to proceed to the next instruction. Furthermore, if 
the JIT compiler ever gets smarter with regard to 
global register allocation, the additional branches 
introduced by Guarana will not get it confused. 


There is another optimization, that is much harder 
to implement within Kaffe, but that could reduce 
the overhead of loops and methods that make heavy 
access of a particular object or array. The test for 
the existence of a meta-object could be performed 
before entering the loop or starting the sequence, 
and different versions of the code would be gener- 
ated: one, in which no meta-object test is performed 
for that object, and another in which the test is per- 
formed in every iteration, because the meta-object 
may change. This optimization is based on a sim- 
ilar proposal for optimizing array reference check- 
ing [15]. Unfortunately, this kind of optimization 
can only be performed if no method invocation nor 
interception could possibly occur within the loop or 
sequence, so as to ensure that reconfiguration does 
not take place within the same thread. Even in this 
case, other threads might reconfigure the object or 
array while the code runs, so synchronization oper- 
ations must also be ruled out, because, by definition 
of the Java Virtual Machine Specification [10], they 
flush a local cache a thread might maintain. But it 
may still be worth the effort for array and field op- 
erations, given that the overhead imposed on them 
is still large. 


7 Conclusions 


Our research on computational reflection was ini- 
tially motivated by our willingness to verify the use 
of MOPs as a tool for structuring and building en- 
vironments for fault-tolerant distributed program- 
ming. We intended to design and implement a li- 
brary like MOLDS [19], a library of reusable and 
combinable meta-level components useful for dis- 
tributed applications, such as persistence, distribu- 
tion, replication and atomicity. 
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Unfortunately, none of the existing reflective archi- 
tectures supported composition of meta-objects in a 
way that fulfilled our needs. Therefore, we started 
the development of Guarana. This paper is an ef- 
fort to convey the positive and negative aspects of 
this experience. 


Guarana provides a powerful and secure mech- 
anism to combine meta-objects into dynamically 
modifiable, elaborate meta-configurations. In ad- 
dition to enforcing a clear separation between the 
reflective levels of an application, the MOP of 
Guarana improves reuse of meta-level code by 
defining a meta-object interface that eases flexible 
composition. Furthermore, it suggests a separation 
of concernts between meta-objects, that implement 
meta-level behavior, from composers, that define 
policies of composition and organization. 


The implementation of the reflective architecture 
of Guarana required some modifications in a Java 
interpreter, but not in the Java programming lan- 
guage. Thus, any program created and compiled 
with any Java compiler will run on our implementa- 
tion, and it will be possible to use reflective mecha- 
nisms in order to extend them. 


Our modifications have reduced the speed of the in- 
terpreter, but we believe the flexibility introduced 
by the reflective capabilities outweighs this inconve- 
nience. Furthermore, the performance impact anal- 
ysis has revealed the current hot spots in the inter- 
ception mechanisms. We expect to reduce this im- 
pact by implementing the suggested optimizations. 


Now that we have Guarana, we are concentrat- 
ing our efforts on the design and implementation 
of MOLDS. The interaction of the various mecha- 
nisms foreseen for MOLDS will fully demonstrate 
the power of our MOP. Meanwhile, other projects 
based on Guarana are demonstrating its flexibility 
and ease of use. Tropyc [1] is a pattern language for 
the domain of cryptography, that is currently using 
Guarana in order to transparently introduce cryp- 
tographic mechanisms in electronic commerce appli- 
cations. The composition strategy of Guarana has 
also supported the implementation of the Reflective 
State Pattern and of its adaptation to the domain 
of fault tolerance [3, 4]. 


A last evidence of the usefulness of our approach is 
the possibility of creating a reflective ORB by sim- 
ply running a 100% Pure Java ORB in Guarana. 
By doing this, we provide to the users of the ORB 
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the ability to create reflective middleware and ap- 
plications, with a development cost close to zero. 


The experience with the design and implementation 
of Guarana and related applications allows us to 
conclude that initiatives by the software industry to 
build software that is highly adaptable and reusable 
should incorporate MOPs as flexible as, and at least 
as efficient as the one we have described. 


A Obtaining Guarana 


Additional information about Guarana can be 
obtained in the Home Page of Guarand, at 
the URL http://www.dcc.unicamp.br/~oliva/ 
guarana/. The source code of its implementa- 
tion atop of the Kaffe OpenVM, on-line documen- 
tation and full papers are available for download. 
Guarana is Free Software, released under the GNU 
General Public License, but its specifications are 
open, so non-free clean-room implementations are 
possible. 
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Abstract 


Java’s object oriented nature along with its 
distributed nature make it a good choice for 
network computing. The use of virtual meth- 
ods associated with Java’s object oriented 
behavior requires accurate target prediction 
for indirect branches. This is critical to the 
performance of Java applications executed 
on deeply pipelined, wide issue processors. 
In this paper, we investigate the use of a 
path history based predictor to accurately 
determine the target of these virtual methods. 
The effect of varying the various parameters 
of the predictor on the misprediction rates 
is studied using various Java benchmarks. 
Results from this study show that the eze- 
cution of Java code will benefit from more 
sophisticated branch-predictors. 


1 Introduction 


Java is a class-based object oriented lan- 
guage that is used extensively for building 
networked applications. Some of the features 
of Java such as the use of virtual methods, dy- 
namic loading and symbolic resolution that 
make it suitable for developing networked 
software applications also slow the execution 
speed of Java code. In this paper, we focus 


on addressing the performance issues involved 
with the use of virtual methods in Java. 


The use of virtual methods as the default 
method invocation mechanism results in the 
execution of frequent indirect branch exe- 
cutions. The invokevirtual JVM bytecode 
[1] that is used to perform virtual method 
calls constitutes 5% of the Java bytecodes 
executed on an average for the benchmarks 
shown in Table 2. Many of the current JVM 
implementations such as Sun’s JDK inter- 
preter and CACAO Just-in-Time Compiler[2] 
use a dispatch table to implement the invoke- 
virtual bytecode. When a virtual method 
is invoked, the target address is obtained 
from a fixed index into the the dispatch ta- 
ble of the current object. Finally, an indi- 
rect branch instruction is executed to jump to 
the fetched target address. Thus, an indirect 
branch is executed for every virtual method 
invoked. The accurate prediction of these in- 
direct branches is critical to the performance 
of Java virtual machine (JVM) implementa- 
tions executing on deeply pipelined systems. 
Speculative execution is used in such archi- 
tectures to avoid the performance loss asso- 
ciated with the execution of branch instruc- 
tions. Accurate branch predictors are essen- 
tial to avoid discarding the results of the spec- 
ulative execution following a misprediction. 


Current processors employ a branch tar- 
get buffer (BTB) based mechanism to pre- 
dict the indirect branches [12]. The mispre- 


5th USENIX Conference on Object-Oriented Technologies and Systems (COOTS '99) 





217 





218 


diction rates for virtual method calls using 
the branch target buffer is found to range up 
to 27% as shown in Table 1 for the studied 
benchmarks. Previous researchers have suc- 
cessfully used path history information to im- 
prove the prediction of direct branches [8, 16]. 
In this paper, the path history of virtual 
method calls is used to predict target ad- 
dresses of virtual method invocations. The 
path history provides the capability to distin- 
guish between different dynamic executions 
of the same virtual method. In the path his- 
tory based predictor, a hashing function of 
the path history of target addresses and the 
virtual method call site address is used to in- 
dex a target cache. The cached entry pro- 
vides the predicted target address of the vir- 
tual call. 


The paper is organized as follows. Section 
2 discusses the background and motivation 
for this work. In section 3, the path history 
based target address predictor is introduced. 
Next, the experimental strategy and bench- 
marks used in this study are explained in 
section 4. The effect of the various param- 
eters of the path-history based target address 
predictor on the performance of the predic- 
tor is studied using the benchmarks in Sec- 
tion 5. Starting from a fully associative tar- 
get buffer of unlimited size, the parameters 
are optimized sequentially to account for the 
hardware constraints such as buffer size and 
limited associativity. The number of history 
buffers, the path history length, the number 
of target address bits, the hashing function, 
and the buffer structure were the parameters 
varied. Concluding remarks are provided in 
Section 6. 


2 Background 


The problem of target prediction for in- 
direct branches has been investigated for C 
and C++ programs. Calder and Grunwald 
proposed a 2-bit strategy for updating the 
branch target buffer (BTB) [18]. The tar- 
get address entry in the BTB is updated only 
when two consecutive predictions at that tar- 
get address are incorrect. This strategy as 
opposed to the default strategy of updating 
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the entry on each misprediction was shown 
to improve the performance. Emer and Gloy 
present several single-level predictors based 
on a combination of the values of program 
counter, stack pointer, register number and 
stack address [19]. They performed their 
study on SPECint95 programs. 


Previous research has shown the use of 
correlation information from path history 
to predict the execution of direct branches 
[8, 16]. Recently, the path history infor- 
mation has been used to predict indirect 
branches {17, 20]. Chang, Hao and Patt pro- 
posed a target cache that uses the branch 
history to distinguish different dynamic oc- 
currences of each indirect branch [17]. Their 
study was performed on select SPECint95 
programs. Their work also shows the cor- 
relation between higher misprediction rates 
and slower execution speed. In this paper, 
we make use of this observation and focus 
on improving misprediction rates. The object 
oriented programs in C++ and Java use indi- 
rect branches with a much higher frequency 
than in SPECint95 programs. Target address 
prediction for indirect branches using a suite 
of C++ programs and SPECint95 was per- 
formed in [20]. Their study investigated the 
impact of various hardware constraints on the 
performance of a path history based predic- 
tor. Our work uses a similar approach in 
investigating the indirect branch behavior of 
Java programs. Unlike the previous efforts, 
the focus of this work is confined to the tar- 
get prediction of indirect branches that occur 
due to the virtual method invocations. The 
best way to improve the performance of vir- 
tual method invocations is to eliminate the 
virtual calls by inlining or statically binding 
them [14]. However, only a portion of the 
calls can be safely bound statically [15]. We 
identify Java code characteristics that enable 
the use of path based predictors in identifying 
the target of the virtual calls. 


Since virtual method invocation has been 
identified as one of the major bottlenecks for 
the performance of Java code [11, 13], the im- 
pact of the various parameters of the path 
history predictor on prediction accuracy is in- 
vestigated in this work. It was observed in 
[11] that the proportion of virtual methods is 
likely to increase due to the trend towards 
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Table 1: Misprediction rates using normal 
and 2-bit replacement strategies 
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A 32K direct mapped BTB was used. 
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fine-grained object design in Java applica- 
tions. In such an environment, big objects 
become many smaller objects. Consequently 
big methods become many smaller methods. 


correlation among the receiver types at. call 
sites. This is due to the presence of corre- 
lation among consecutive call sites as shown 
in Figure 1. In this example, the shape ob- 
ject s invokes a series of virtual calls and the 
different call sites in the function drag_drop 
have the same receiver type. Another reason 
for the correlation is due to a sequence of vir- 
tual calls triggered by a single virtual method 
call and the presence of looping constructs as 
shown in Figure 2. Here, the invocation of 
a virtual method to print a string triggers 
a sequence of method invocations. The use 
of path history in exploiting such correlation 
among call sites to predict the destination of 
virtual calls is investigated in this paper. 


This causes many more method invocations // This is a drag drop code in GUI based programs 
and method invocation increasingly becomes Class mouse{ 
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a performance bottleneck. In [13], profiling of Public void drag_drop(shape &s) { 
various Java benchmarks was done to identify 
virtual methods as one of the bottlenecks in 
Java execution. This work also investigated 
the receiver type locality at virtual method 
call sites. 


s.invalidate_object_area(); 
screen.invalidate(); 
s.move(new_location); 
s.update_object_area(); 
s.repaint(); 
screen.update(); 


Java performance studies have been per- bY 
formed in [9], [10] to investigate the need for : ba * pom . 
architectural support for Java execution. In Fig 1: Correlation in receiver types 


[10], it was concluded based on their study 
of Java interpreters that it may be prema- 
ture to provide hardware support for Java 
execution. The results of this paper indicate 
that the micro-architectural resources such as 
branch predictors can be enhanced to support 
Java execution. However, it must be noted 
that the support proposed here is based on 
the Java language characteristics rather than 
just the interpreter characteristics analyzed 
in [10]. 


3 Path History Based Predictor 


In this section, the use of path history to 
improve the target prediction accuracy is in- 
vestigated. Path history consists of target ad- 
dresses of recently executed branches. The 
history of target addresses provides useful 
correlation information that can be used to 
improve the branch prediction accuracy. Vir- 
tual method calls in Java programs exhibit 


among call sites 


System.out.printIn("Hello"); 


| 


Java.io.PrintStream.printIn(..) 
repeat for length_of("Hello") times 
Java.io.printStream.Print(..) 
Java.lang.String.charAT(..) 
Java.io.PrintStream.write(..) 
Java.io.BufferedOutputStream.write(..) 
Java.io.PrintStream. write(..) 
Java.io.BufferedOutputStream.write(..) 
Java.io.BufferedOutputStream.flush(..) 


Fig 2: A single virtual method invoking 
a series of methods 


Figure 3 shows the predictor based on the 
use of path history information. The pro- 
gram counter stores the address of the virtual 
method call site. An indexing function of the 
program counter is used to access the path 
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history information corresponding to the call 
site. Then a hashing function of the path 
history information from the history buffers 
and the program counter is used to form the 
hashing address. This address is used to in- 
dex the target buffer to obtain the target ad- 
dress. The various parameters involved in the 
design of such a predictor include the num- 
ber of history buffers (n), path history length 
(p), the number of bits of each target address 
registered in the history buffer (b), the hash- 
ing function, and the structure of the target 
buffer. The target buffer could either be tag- 
less or a tagged buffer. In a tagless buffer, 
the hashing address is used to index into the 
buffer and no tag comparisons are involved. 
Hence, the mapping of two different hashing 
addresses in to the same target buffer loca- 
tion are not distinguished. In contrast, the 
tagged buffer has a tag associated with each 
entry. These tags help to distinguish different 
history patterns that map to the same loca- 
tion. The influence of these parameters on 
the performance of the predictor is studied in 
the subsequent sections. 


215K 
DAK 


O.S task dispatcher [4] | 1517K 
Deltablue Constraint solver [4] 12082K 
Garbage collector (3 


T1 - Number of virtual calls executed 





Table 2: Description of Benchmarks 


History Buffers Ct Buffer 


Hashing Address 


Hashing 
Function 


Indexing 
Function 


Program Counter 


Fig 3: Path history based two level in- 
direct branch prediction 


4 Experimental Setup 


The following experimental strategy was 
used in this study. The traces of indirect 
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a Address 


branches corresponding to the virtual method 
call sites were obtained through modifica- 
tions to the JDK 1.0.2 source code. The 
benchmarks shown in Table 2 were executed 
using the modified JDK 1.0.2 on a Sparc- 
20 processor under Solaris 2.5 operating sys- 
tem. The javac and javadoc benchmarks 
are large applications with 25,400 and 28,471 
lines of code respectively. The richards and 
deltablue benchmarks are medium size bench- 
marks with 410 and 984 lines of code [4]. 
These two benchmark were chosen as they 
have been used in earlier studies of polymor- 
phic behavior of object oriented languages [5]. 
The heap benchmark is an 4495 line applet 
that implements incremental garbage collec- 
tion. 


5 Predictor Parameter Varia- 
tions 


In this section, the effect of varying the pa- 
rameters involved in the design of the branch 
predictor is studied. The parameters were op- 
timized sequentially and the following subsec- 
tions report them in that order. The stud- 
ied parameters include the number of history 
buffers (n), path history length (p), the num- 
ber of bits of each target address registered in 
the history buffer (b), the hashing function, 
and the structure of the target buffer. 


5.1 Number of History Buffers 


The number of history buffers determines 
the number of virtual method call sites that 
share their history. When n = 1, all the vir- 
tual call sites share the same history buffer 
and the resulting predictor is. called as a 
global history predictor. In contrast, a per- 
address history predictor keeps a separate his- 
tory for all virtual method call sites. This is 
achieved when n = 2”, where w is the word 
size. When the number of history buffers is 
between 1 and 2”, a set of addresses use the 
same history buffer. The effect of the num- 
ber of history buffers on misprediction rate 
was studied by using the most significant bits 
of the program counter to access the history 
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buffers. To mask the effects of other parame- 
ters of the predictor, a fully associative target 
buffer of unlimited size was used. Further, all 
the bits of the target address were registered 
in the path history buffers and the hashing 
address was formed by concatenating the se- 
lected path history buffer with the program 
counter. 


Figures 4 and 5 show the results of this 
investigation for javac and richards bench- 
marks respectively. The h most significant 
bits of the program counter select the his- 
tory buffer corresponding to the call site. The 
global history predictor is simulated when 
h = 0. In contrast, h = w corresponds to 
the per-address history predictor. It is ob- 
served from the figures that the global path 
history predictor performs better than those 
that use per-address or per-set history pre- 
dictors. For example, the misprediction rates 
increase from 2.6% for global path history to 
4.2% for the per-address scheme using javac 
with a path length of 2. This indicates that 
the correlation across call sites is more use- 
ful than the self history at a call site in pre- 
dicting the targets. This can be ascribed to 
the execution of a series of virtual calls corre- 
sponding to the invocation of a single virtual 
call as shown earlier. A global path history 
can capture the effect of such constructs bet- 
ter than a per-address scheme. Thus, a global 
path history is used in refining the other pa- 
rameters of the predictor. 


5.2 History Path Length 


The number of target addresses of the vir- 
tual methods registered in each history buffer 
is called as the history path length, p. When 
p = 0, the two-level path history predictor be- 
comes a single level predictor similar to the 
BTB strategy. The variation in path length 
can help in determining whether the corre- 
lation among the target addresses of virtual 
methods is long-term or short-term. The ef- 
fect of path length variation was studied with 
a fully associative target buffer of unlimited 
size along with a global path history. Fig- 
ure 6 shows the results of this study for the 
different benchmarks. 


~ 






a 


on 


Misprediction rate (%) 


10 15 
Number of history bits 


Fig 4: Variation in misprediction rate 
with number of history buffer sets for 
javac. A fully associative target buffer 
of unlimited size was used. 
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Misprediction rate (%) 
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Number of history bits 


Fig 5: Variation in misprediction rate 
with number of history buffer sets for 
richards. A fully associative target 
buffer of unlimited size was used. 
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The path length affects the misprediction 
rate in two ways. Firstly, misses occur when 
the path length is too small to capture a long- 
term dependence. Secondly, longer paths 
take a longer time to adapt to branch be- 
havior changes and this results in start-up 
misses. Thus, a longer path would capture 
more long term dependence but would have 
more start-up misses. In contrast, a shorter 
path fails to capture the long-term regulari- 
ties in method invocation targets but adapts 
quickly to changes in branch behavior. 


The javac benchmark reflects this trade- 
off clearly. The misprediction rate reduces 
from 4.6% when the path length is zero to 
a misprediction rate of 2.6% when the path 
length is two. In this phase, the effect of cap- 
turing more regularities dominates the effect 
of start-up misses. However, the mispredic- 
tion rates increase when the path length is in- 
creased beyond two, specifically from 2.6% to 
5.1% when path length is increased from two 
to seven. The start-up misses begin to dom- 
inate any improvement obtained by captur- 
ing virtual method history dependence longer 
than two. This indicates that most path his- 
tory patterns used in javac have a relatively 
short period. The javadoc benchmark also 
exhibits a similar behavior. 


The heap benchmark does not benefit from 
the path history information. It is observed 
that the branch target buffer scheme per- 
forms better than the predictor with the path 
history. This is due to the relatively con- 
stant target addresses at the call sites in the 
heap benchmark. Hence, the path history in- 
formation only adds to the start-up misses 
and does not benefit from capturing any ad- 
ditional regularities. In contrast, the richards 
and deltablue benchmarks benefit from long 
path lengths. The misprediction rates keep 
decreasing as path lengths increase from 0 to 
8. This shows that these benchmarks have a 
long-term correlation that enables the over- 
shadowing of start-up misses associated with 
longer paths. These results indicate that the 
optimum values for the path length differ 
based on the benchmark. 
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Misprediction rate (%) 











Fig 6: Variation in misprediction rate 
with path length using global history. 
A fully associative target buffer of un- 
limited size was used. 


5.3. Path History Compression 


The global history pattern along with the 
branch address stored in the program counter 
is used to index the target buffer. When all 
the bits of the history buffer and the program 
counter are used, the resulting bit pattern is 
long and is equal to (p+ 1) * w. The number 
of different path history patterns captured 
by this hashing address length is 2((?+!)*), 
However, most programs do not have that 
many patterns. Thus, the effect of varying 
the number of bits stored per target address 
stored in the history buffer on the mispre- 
diction rates was investigated. Table 3 shows 
the results of this investigation for the bench- 
marks. The least significant bits of the target 
addresses were used in the history patterns. 
It is observed that the least significant bits 
capture more information than the more sig- 
nificant bits. For javac, javadoc and deltablue 
the misprediction rates decrease when b is in- 
creased from 2 to 8 and does not change when 
bit size of b is increased further. This study 
shows that registering only the least signif- 
icant bits of the target address in the his- 
tory buffer could reduce the bit width of the 
hashing address without much loss in perfor- 
mance. 
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Table 3: Effect of history bit compression of 
misprediction rates 










Pa 37_[ 24 [356 | 13 __| 
ref 3s [20 [60 | 08 _| 
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[aa] 26 | 12 | 40 | 96 | 


b is the number of bits from target address used in 
the path history information. 
A path length of two and a fully associative target 
buffer of unlimited size was used. 














5.4 Hashing Function 


The effect of the hashing function on the 
misprediction rates was investigated using 
limited size tagless target buffers. The hash- 
ing function needs to utilize both the path 
history and the program counter (call site) 
information effectively. The simplest hashing 
function is the concatenation scheme shown 
in Figure 7. Here, h bits of path history infor- 
mation and the program counter are concate- 
nated to form the least significant and most 
significant bits of the hashing address respec- 
tively. Then, the s least significant bits of the 
hashing address are used to index the target 
buffer of size 2°. The contents of the indexed 
entry provides the predicted target address. 


The bit width of the path history buffer, 
h that constitutes the least significant bits 
used to index the target buffer was varied 
and its effect on the misprediction rate was 
investigated. This was performed to study 
the relative importance of the path history 
and program counter information. Figure 8 
shows the results of this for an 8K entry tar- 
get buffer using javac for different values of b. 
It is observed that the misprediction rate de- 
creases, when h increases from 0 to 6. When 
h is increased further, the misprediction rates 
increase. Since the 8K target buffer is indexed 
using a fixed size 13-bit index, the number of 
bits from the program counter used in the in- 
dex reduces as the value of h increases. This 
indicates that the call site location is rela- 
tively more important than just the path his- 


tory information. The target address being 
primarily determined by the call site location 
and path history providing only additional in- 
formation in the prediction accounts for this 
behavior. Table 4 shows the relative impor- 
tance of the path history and call site loca- 
tion. It is observed that using only the 13-bit 
call site location to index the 8K target buffer, 
a misprediction rate of 4.9% is achieved. The 
misprediction rate increases to 12.9% when 
12 bits of path history and 1-bit of the call 
site location are used for javac. 


Hashing Address 








Program Counter 


History Buffer 


Fig 7: Tagless concatenation scheme 
with global path history 


Misprediction rate (%) 








0 2 4 6 8 10 12 
Number of bits used from path history 


Fig 8: Variation in misprediction rate 
with concatenated length using tagless 
concatenation scheme with global path 
history. An 8K target buffer was used. 
b refers to number bits per address 
recorded in the history buffer 


In order to utilize both the program 
counter and the path history bits more effec- 
tively for a fixed size target buffer, a XOR 
hashing scheme shown in Figure 9 was in- 
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An entry y in column S refers to the concatenated 
index formed with y bits of path history and 
remaining bits from program counter. All schemes 
register 4 bits of target address in history buffer 


vestigated. The XOR hashing function helps 
in combining more information from the pro- 
gram counter and the path history bits as 
compared to the concatenation scheme. Here, 
a bitwise XOR of the program counter and 
the path history buffer bits is performed to 
obtain the hashing address. Then, the least 
significant bits of the hashing address are 
used to index the target buffer. Two replace- 
ment strategies were studied to update the 
target buffers using the global XOR scheme. 
These schemes are the same as those stud- 
ied to update the BTB. It was observed that 
the 2-bit scheme performs better for the XOR 
scheme for most of the benchmarks. Fig- 
ure 10 shows the results for the javac bench- 
mark. The 2-bit strategy is referred to as the 
XOR scheme in the rest of the paper. The ef- 
fectiveness of the XOR scheme as compared 
to the concatenation scheme is shown in Ta- 
ble 4. It is observed that for an 8K target 
buffer, the minimum misprediction rate using 
the concatenation scheme is 4.1% compared 
to the 3.6% using the XOR scheme for javac. 


‘ Target Buffer 
Hashing Address 








Method Address 


Program Counter 


Fig 9: Tagless XOR scheme with global 
path history 


Next, the effect of path length on the mis- 
prediction rate of the XOR scheme was in- 
vestigated. In order to vary the path length, 
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Fig 10: Comparison of normal and 2- 
bit update schemes using global tag- 
less XOR. All configurations use 4-bit 
of method address in history buffer 


the number of bits b written from each target 
address into the history buffer was varied. If 
s bits are required to index the tagless target 
buffer and p is the path length, b was cho- 
sen such that b* p < s. Figures 11 and 12 
show the results of this study for javac and 
richards benchmarks respectively. The mis- 
prediction rate for javac exhibits a similar 
trend as the fully associative unconstrained 
target buffer size. It achieves the minimum 
misprediction rates for a path length of two. 
For the richards benchmark the mispredic- 
tion rates are the least for a path length of 
two when target buffer sizes are small(0.5K to 
2K). When target buffer size is increased (4K 
to 32K), the minimum misprediction value is 
achieved for a path length of three. Thus, 
a longer path length improves misprediction 
rate with an increase in the target buffer size. 
This is due to the greater number of bits 
of each target address constituting the index 
portion of the target buffer for a given path 
length. 


5.5 Tagged versus Tagless Target 
Buffers 


We also studied the impact of interference 
due to the presence and absence of tags with 
the target buffers. The XOR hashing scheme 
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Fig 11: Variation in misprediction rate 
with path length for javac for different 
target buffer sizes. Uses tagless XOR 
scheme with global path history 
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Fig 12: Variation in misprediction rate 
with path length for richards for dif- 
ferent target buffer sizes. Uses tagless 
XOR scheme with global path history 


was utilized in studying both the approaches. 
In the tagless scheme, the target address of 
the indirect branch is selected using the hash- 
ing address to index the target buffer. Since, 
no tags are associated with each target buffer 
entry more than one hashing address can map 
to the same location. Due to this interfer- 
ence, the target of the indirect branch is se- 
lected based on the outcome of some other 
branch path pattern. A positive interference 
occurs when the when two different patterns 
that map to the same target location have the 
same target address. Similarly, when the in- 
terference results in more misses it is called as 
negative interference. A tagged target buffer 
can be used to eliminate the effects of nega- 
tive interference. 


The impact of the tagged and tagless target 
buffers was studied for various buffer sizes by 
varying the associativity and the path length. 
Figures 13 and 14 shows the variation in mis- 
prediction rate using target buffer sizes of 2K 
and 4K for the tagged and tagless buffers re- 
spectively for javac and richards benchmarks. 
Additional entries were provided for the tag- 
less case to account for the area overhead in 
maintaining tags. In these plots, an increase 
in the number of target address bits in his- 
tory buffer corresponds to a decrease in path 
length. It is observed that the misprediction 
rates decrease with increase in associativity 
for the tagged buffers. Also, it is observed 
that there is not a significant improvement 
when associativity is increased beyond 8. For 
javac, it is observed that the tagless target 
buffer performs better than the the tagged 
buffers when a = 1 and a = 2 for all path 
lengths. It must be noted that the tagged 
buffers are useful only when they are able to 
register the alternate target address when a 
conflicting path history is identified. Thus, 
direct mapped tagged buffers perform inher- 
ently worse than the tagless buffers as they 
also do not benefit from positive interference. 
Hence, higher associativities are required in 
the tagged caches to benefit from the absence 
of negative interference. 


When path lengths become large (number 
of target bits in history buffer becomes small), 
the number of different patterns generated 
corresponding to an indirect branch increases. 
Hence, a tagless buffer can benefit from the 
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positive interference between these different 
patterns. Thus, it is observed that the tag- 
less buffer performs better than the 4 and 8- 
way associative tagged buffers for longer path 
lengths for javac. For the richards benchmark 
the effect of positive interference is lesser, 
since it benefits from longer distinguishing 
patterns as was observed in Figure 6. Thus, 
the tagless target buffer performs better than 
only a direct mapped tagged buffer for all 
path lengths for the richards benchmark. It 
is also be observed that the the tagged buffers 
of higher associativity provide a greater im- 
provement in prediction rates for smaller path 
lengths in both the benchmarks. This indi- 
cates that the conflict misses due to short 
term variations in targets is being reduced by 
the tagging mechanism. 


% 





25) 


Misprediction rate (%) 
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Number of target address bits in history butter 


Fig 13: Comparison of tagged and tag- 
less target buffers using javac. The size 
of the tagged and tagless target buffers 
were 2K and 4K respectively. The XOR 
hashing scheme with global path his- 
tory were used for all cases. 


Figures 15 and 16 show the variation in 
misprediction rates for the various buffer 
sizes. A tagged target buffer requires addi- 
tional area overhead for maintaining the tags 
as compared to a tagless target buffer. Hence, 
a tagless target buffer can have more number 
of entries corresponding to the same imple- 
mentation cost. It can be observed that an as- 
sociative tagged target buffer with 8 or more 
entries per set outperforms the tagless buffer. 
It can also be observed that the increase in 
buffer size reduces the conflict misses signifi- 
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Fig 14: Comparison of tagged and tag- 
less target buffers using richards. The 
size of the tagged and tagless target 
buffers were 2K and 4K respectively. 
The XOR hashing scheme with global 
path history were used for all cases. 


cantly for the tagless and tagged buffers with 
associativity less than or equal to 4. The 
tagged buffers with associativity greater than 
4 do not benefit much from increase in buffer 
size since the higher number of entries per 
set already takes care of most of the con- 
flict misses. Due to the increase in access 
and cycle times associated with the higher 
associativities [21], the area overhead of the 
tags in the tagged buffers and the small differ- 
ence in the misprediction rates between tag- 
less and tagged buffers with large associativ- 
ities (a > 4), the tagless target buffer may be 
a better choice in many cases. 


6 Conclusion 


The effectiveness of using path history 
to predict the target addresses of indirect 
branches due to virtual method invocations 
used in Java applications was investigated. 
The influence of the various parameters such 
as number of history buffers, path length, 
hashing function and the structure of the tar- 
get buffers on the misprediction rates was in- 
vestigated. The XOR hashing scheme with a 
global path history and a 2-bit update pol- 
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Fig 15: Comparison of tagged and tag- 
less target buffers using javac with vari- 
ation in target buffer size. A path 
length of 2 was used. 


Misprediction rate (%) 





0 1000 2000 3000 


4000 5000 
Cache Size (words) 


Fig 16: Comparison of tagged and tag- 
less target buffers using richards with 
variation in target buffer size. A path 
length of 4 was used. 


icy performed the best for almost all config- 
urations. Also, it was found that the tag- 
less target buffers achieve a prediction rate as 
good as the tagged buffers without suffering 
from the area overhead for tags and the in- 
creased access times associated with the asso- 
ciative buffers. Using the branch target buffer 
based predictor with an 8K buffer, mispredic- 
tion rates of 4.9% and 23.4% were obtained 
for the javac and richards benchmarks re- 
spectively. The misprediction rates reduce to 
3.6% and 2.4% for the two benchmarks using 
the proposed path history based predictor. 
The results show that the design of micro- 
architectural features such as the branch pre- 
dictor will influence the execution speed of 
Java code. 
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Abstract 


Existing profilers for Java applications typically rely on 
custom instrumentation in the Java virtual machine, and 
measure only limited types of resource consumption. 
Garbage collection and multi-threading pose additional 
challenges to profiler design and implementation. 


In this paper we discuss a general-purpose, portable, and 
extensible approach for obtaining comprehensive pro- 
filing information from the Java virtual machine. Pro- 
filers based on this framework can uncover CPU us- 
age hot spots, heavy memory allocation sites, unnec- 
essary object retention, contended monitors, and thread 
deadlocks. In addition, we discuss a novel algorithm 
for thread-aware statistical CPU time profiling, a heap 
profiling technique independent of the garbage collec- 
tion implementation, and support for interactive profil- 
ing with minimum overhead. 


1 Introduction 


Profiling [14] is an important step in software develop- 
ment. We use the term profiling to mean, in a broad 
sense, the ability to monitor and trace events that oc- 
cur during run time, the ability to track the cost of these 
events, as well as the ability to attribute the cost of the 
events to specific parts of the program. For example, a 
profiler may provide information about what portion of 
the program consumes the most amount of CPU time, 
or about what portion of the program allocates the most 
amount of memory. 


This paper is mainly concerned with profilers that pro- 
vide information to programmers, as opposed to profil- 
ers that feedback to the compiler or run-time system. 
Although the fundamental principles of profiling are the 
same, there are different requirements in designing these 
two kinds of profilers. For example, a profiler that sends 
feedback to the run-time system must incur as little over- 


head as possible so that it does not slow down program 
execution. A profiler that constructs the complete call 
graph, on the other hand, may be permitted to slow down 
the program execution significantly. 


This paper discusses techniques for profiling support in 
the Java virtual machine [17]. Java applications are writ- 
ten in the Java programming language [10], and com- 
piled into machine-independent binary class files, which 
can then be executed on any compatible implementation 
of the Java virtual machine. The Java virtual machine is 
a multi-threaded and garbage-collected execution envi- 
ronment that generates various events of interest for the 
profiler. For example: 


e The profiler may measure the amount of CPU time 
consumed by a given method in a given class. In 
order to pinpoint the exact cause of inefficiency, the 
profiler may need to isolate the total CPU time of a 
method A. £ called from another method B.g, and 
ignore all other calls to A. £. Similarly, the profiler 
may only want to measure the cost of executing a 
method in a particular thread. 


e The profiler may inform the programmer why there 
is excessive creation of object instances that be- 
long to a given class. The programmer may want 
to know, for example, that many instances of class 
D are allocated in method C.h. More specifically, 
it is also useful to know that majority of these allo- 
cations occur when B.g calls C..h, and only when 
A.fcalls B.g. 


The profiler may show why a certain object is not 
being garbage collected. The programmer may 
want to know, for example, that an instance of class 
C is not garbage collected because it is referred to 
by an instance of class D, which is then referred 
to by a local variable in an active stack frame of 
method B.g. 


e The profiler may identify the monitors that are con- 
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tended by multiple threads. It is useful to know, for 
example, that two threads, T; and T>, repeatedly 
contend to enter the monitor associated with an in- 
stance of class C. 


e The profiler may inform the programmer what 
causes a given class to be loaded. Class loading 
not only takes time, but also consumes memory re- 
sources in the Java virtual machine. Knowing the 
exact reason that a class is loaded, the programmer 
can optimize the code to reduce memory usage. 


The first contribution of this paper is to present a 
general-purpose, extensible, and portable Java virtual 
machine profiling architecture. Existing profilers typ- 
ically rely on custom instrumentation in the Java vir- 
tual machine and measure limited types of resource con- 
sumption. In contrast, our framework relies on an in- 
terface that provides comprehensive support for profilers 
that can be built independent of the Java virtual machine. 
A profiler can obtain information about CPU usage hot 
spots, heavy memory allocation sites, unnecessary ob- 
ject retention, monitor contention, and thread deadlocks. 
Both code instrumentation and statistical sampling are 
supported. Adding new features typically requires intro- 
ducing new event types, and does not require changes 
to the profiling interface itself. The profiling interface is 
portable. It is not dependent on the internal implementa- 
tion of the Java virtual machine. For example, the heap 
profiling support is independent of the garbage collec- 
tion implementation, and can present useful information 
for a wide range of garbage collection algorithms. The 
benefit of this approach is obvious. Tools vendors can 
ship profilers that work with any virtual machine that 
implements the interface. Equivalently, users of a Java 
virtual machine can easily take advantage of the profilers 
available from different tools vendors. 


The second contribution of this paper is to introduce 
an algorithm that obtains accurate CPU-time profiles in 
a multi-threaded execution environment with minimum 
overhead. It is a standard technique to perform statisti- 
cal CPU time profiling by periodically sampling the run- 
ning program. What is less known, however, is how to 
obtain accurate per-thread CPU time usage on the ma- 
jority of operating systems that do not provide access 
to the thread scheduler or a high-resolution per-thread 
CPU timer clock. In these cases, it is difficult to at- 
tribute elapsed time to threads that are actually running, 
as opposed to threads that are blocked, for example, in 
an I/O operation. Our solution is to determine whether 
a thread has run in a sampling interval by comparing the 
check sum of its register sets. To our knowledge, this is 
the most portable technique for obtaining thread-aware 
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CPU-time profiles on modern operating systems. 


The third contribution is to demonstrate how our ap- 
proach supports interactive profiling with minimum 
overhead. Users can selectively enable or disable dif- 
ferent types of profiling while the application is running. 
This is achieved with very low space and time overhead. 
Neither the virtual machine, nor the profiler need to ac- 
cumulate large amounts of trace data. The Java virtual 
machine incurs only a test and branch overhead for a dis- 
abled profiling event. Most events occur in code paths 
that can tolerate the overhead of an added check. As 
a result, the Java virtual machine can be deployed with 
profiling support in place. 


We have implemented all the techniques discussed in 
this paper in the Java Development Kit (JDK) 1.2 
[15]. Numerous tool vendors have already built profil- 
ing front-ends that rely on the comprehensive profiling 
support built into the JDK 1.2 virtual machine. 


We will begin by introducing the general-purpose profil- 
ing architecture, before we discuss the underlying tech- 
niques in detail. We assume the reader is familiar with 
the basic concepts in the Java programming language 
{10] and the Java virtual machine [17]. 


2 Profiling Architecture 


The key component of our profiling architecture is a 
general-purpose profiling interface between the Java vir- 
tual machine and the front-end responsible for present- 
ing the profiling information. A profiling interface, as 
opposed to direct profiling support in the virtual machine 
implementation, offers two main advantages: 


First, profilers can present profiling information in dif- 
ferent forms. For example, one profiler may simply 
record events that occur in the virtual machine in a trace 
file. Alternatively, another profiler may receive input 
from the user and display the requested information in- 
teractively. 


Second, the same profiler can work with different virtual 
machine implementations, as long as they all support the 
same profiling interface. This allows tool vendors and 
virtual machine vendors to leverage each other’s prod- 
ucts effectively. 


A profiling interface, while providing flexibility, also has 
potential shortcomings. On one hand, profiler front-ends 
may be interested in a diverse set of events that occur in 
the virtual machine. On the other hand, virtual machine 
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implementations from different vendors may be differ- 
ent enough that it is impossible to expose all the inter- 
esting events through a general-purpose interface. 


The contribution of our work is to reconcile these differ- 
ences. We have designed a general-purpose Java Virtual 
Machine Profiler Interface (IVMPI) that is efficient and 
powerful enough to suit the needs of a wide variety of 
virtual machine implementations and profiler front-ends. 


Figure 1 illustrates the overall architecture. The JVMPI 
is a binary function-call interface between the Java vir- 
tual machine and a profiler agent that runs in the same 
process. The profiler agent is responsible for the com- 
munication between the Java virtual machine and the 
profiler front-end. Note that although the profiler agent 
runs in the same process as the virtual machine, the pro- 
filer front-end typically resides in a different process, or 
even on a different machine. The reason for the sepa- 
ration of the profiler front-end is to prevent the profiler 
front-end from interfering with the application. Process- 
level separation ensures that resources consumed by the 
profiler front-end does not get attributed to the profiled 
application. Our experience shows that it is possible 
to write profiler agents that delegate resource-intensive 
tasks to the profiler front-end, so that running the pro- 
filer agent in the same process as the virtual machine 
does not overly distort the profiling information. 


We will introduce some of the features of the Java vir- 
tual machine profiling interface in the remainder of this 
section, and discuss how such features are supported by 
the Java virtual machine in later sections. 


2.1 Java Virtual Machine Profiler Interface 


Figure 1 illustrates the role of the JVMPI in the overall 
profiler architecture. The JVMPI is a two-way function 
call interface between the Java virtual machine and the 
profiler agent. 


The profiler agent is typically implemented as a 
dynamically-loaded library. The virtual machine makes 
function calls to inform the profiler agent about various 
events that occur during the execution of the Java appli- 
cation. The agent in turn receives profiling events, and 
calls back into the Java virtual machine to accomplish 
one the the following tasks: 


e The agent may disable and enable certain type of 
events sent through the JVMPI, based on the needs 
of the profiler front-end. 


e The agent may request more information in re- 
sponse to particular events. For example, after the 
agent receives a JVMPI event, it can make a JVMPI 
function call to find out the stack trace for the cur- 
rent thread, so that the profiler front-end can inform 
the user about the program execution context that 
led to this JWMPI event. 


Using function calls is a good approach to an efficient 
binary interface between the profiler agent and differ- 
ent virtual machine implementations. Sending profiling 
events through function calls is somewhat slower than 
directly instrumenting the virtual machine to gather spe- 
cific profiling information. As we will see, however, ma- 
jority of the profiling events are sent in situations where 
we can tolerate the additional cost of a function call. 


JVMPI events are data structures consisting of an inte- 
ger indicating the type of the event, the identifier of the 
thread in which the event occurred, followed by infor- 
mation specific to the event. To illustrate, we list the 
definition of the JVMPI_Event structure and one of its 
variants gc_info below. The gc_info variant records 
information about an invocation of the garbage collec- 
tor. The event-specific information indicates the number 
of live objects, total space used by live objects, and the 
total heap size. 


typedef struct { 
jint event_type; 
JNIEnv *thread_id; 


union { 
struct { 
jlong used_objects; 
jlong used_object_space; 


jlong total_object_space; 
} gc_info; 


} ou; 
} JVMPI_Event; 


Additional details of the JVMPI can be found in the doc- 
umentation that is shipped with the JDK 1.2 release [15]. 


2.2 The HPROF Agent 


To illustrate the power of the JVMPI and show how it 
may be utilized, we describe some of the features in the 
HPROF agent, a simple profiler agent shipped with JDK 
1.2. The HPROF agent is a dynamically-linked library 
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Figure 1: Profiler Architecture 


shipped with JDK 1.2. It interacts with the JVMPI and 
presents profiling information either to the user directly 
or through profiler front-ends. 


We can invoke the HPROF agent by passing a special 
option to the Java virtual machine: 


java -Xrunhprof ProgName 


ProgName is the name of a Java application. Note that 
we pass the -Xrunhprof option to java, the opti- 
mized version of the Java virtual machine. We need not 
rely on a specially instrumented version of the virtual 
machine to support profiling. 


Depending on the type of profiling requested, HPROF 
instructs the virtual machine to send it the relevant pro- 
filing events. It gathers the event data into profiling in- 
formation and outputs the result by default to a file. For 
example, the following command obtains the heap allo- 
cation profile for running a program: 


java -Xrunhprof:heap=sites ProgName 


Figure 2 contains the heap allocation profile generated 
by running the Java compiler (j avac) on a set of input 
files. We only show parts of the profiler output here due 
to the lack of space. A crucial piece of information in 
heap profile is the amount of allocation that occurs in 
various parts of the program. The SITES record above 
tells us that 9.18% of live objects are character arrays. 
Note that the amount of live data is only a fraction of the 
total allocation that has occurred at a given site; the rest 
has been garbage collected. 


A good way to relate allocation sites to the source code 
is to record the dynamic stack traces that led to the heap 
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allocation. Figure 3 shows another part of the profiler 
output that illustrates the stack traces referred to by the 
four allocation sites presented in Figure 2. 


Each frame in the stack trace contains class name, 
method name, source file name, and the line number. 
The user can set the maximum number of frames col- 
lected by the HPROF agent. The default limit is 4. Stack 
traces reveal not only which methods performed heap 
allocation, but also which methods were ultimately re- 
sponsible for making calls that resulted in memory allo- 
cation. For example, in the heap profile above, instances 
of the same java/util/HashtableSEntry class 
are allocated in traces 1091 and 1264, each originated 
from different methods. 


The HPROF agent has built-in support for profiling CPU 
usage. For example, Figure 4 is part of the generated 
output after the HPROF agent performs sampling-based 
CPU time profiling on the javac compiler. 


The HPROF agent periodically samples the stack of 
all running threads to record the most frequently ac- 
tive stack traces. The count field above indicates how 
many times a particular stack trace was found to be ac- 
tive. These stack traces correspond to the CPU usage hot 
spots in the application. 


The HPROF agent can also report complete heap dumps 
and monitor contention information. Due to the lack of 
space, we will not list more examples of how the HPROF 
agent presents the information obtained through the pro- 
filing interface. However, we are ready to explain the de- 
tails of how various profiling interface features are sup- 
ported in the virtual machine. 
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SITES BEGIN (ordered by live bytes) Wed Oct 7 11:38:10 1998 
percent live alloc'ed stack class 
rank self accum bytes objs bytes objs trace name 
1 9.18% 9.18% 149224 5916 1984600 129884 1073 char [] 
2 7.28% 16.45% 118320 5916 118320 5916 1090 sun/tools/java/Identifier 
3 7.28% 23.73% 118320 5916 118320 5916 1091 java/util/Hashtable$Entry 


7 3.39% 41.42% 55180 2759 55180 2759 1264 java/util/HashtablesEntry 
SITES END 


Figure 2: HPROF Heap Allocation Profile 





THREAD START (obj=1d6b20, id = 1, name="main", group="main") 


TRACE 1073: (thread=1) 
java/lang/String.<init>(String.java:244) 
sun/tools/java/Scanner.bufferString (Scanner. java:143) 
sun/tools/java/Scanner.scanIdentifier (Scanner. java: 942) 
sun/tools/java/Scanner.xscan(Scanner.java:1281) 


TRACE 1090: (thread=1) 
sun/tools/java/Identifier.lookup (Identifier. java:106) 
sun/tools/java/Scanner.scanIdentifier (Scanner.java:942) 
sun/tools/java/Scanner.xscan(Scanner.java:1281) 
sun/tools/java/Scanner.scan(Scanner.java:971) 


TRACE 1091: (thread=1) 
java/util/Hashtable. put (Hashtable. java: 405) 
sun/tools/java/Identifier.lookup (Identifier. java:106) 
sun/tools/java/Scanner.scanIdentifier (Scanner. java: 942) 
sun/tools/java/Scanner.xscan (Scanner. java:1281) 


TRACE 1264: (thread=1) 
java/util/Hashtable.put (Hashtable. java: 405) 
sun/tools/java/Type.<init> (Type. java:90) 
sun/tools/java/MethodType.<init>(MethodType. java: 42) 
sun/tools/java/Type. tMethod (Type. java:274) 


Figure 3: HPROF Stack Traces 





CPU SAMPLES BEGIN (total = 252378) Wed Oct 07 13:30:10 1998 
rank self accum count trace method 


1 4.96% 4.96% 12514 303 sun/io/ByteToCharSingleByte.convert 

2 3.18% 8.14% 8022 306 java/lang/String.charAt 

3 1.91% 10.05% 4828 301 sun/tools/java/ScannerInputReader.<init> 
4 1.80% 11.85% 4545 305 sun/io/ByteToCharSingleByte.getUnicode 

5 1.50% 13.35% 3783 304 sun/io/ByteToCharSingleByte.getUnicode 

6 1.30% 14.65% 3280 336 sun/tools/java/ScannerInputReader. read 

7 1.13% 15.78% 2864 404 sun/io/ByteToCharSingleByte.convert 

8 1.11% 16.89% 2800 307 java/lang/String. length 

9 1.00% 17.89% 2516 4028 java/lang/Integer.toString 
10 0.95% 18.84% 2403 162 java/lang/System.arraycopy 


CPU SAMPLES END 


Figure 4: HPROF Profile of CPU Usage Hot Spots 
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3 CPU Time Profiling 


A CPU time profiler collects data about how much CPU 
time is spent in different parts of the program. Equipped 
with this information, programmers can find ways to re- 
duce the total execution time. 


3.1 Design Choices 


We considered the following design choices when build- 
ing the support for CPU time profilers: the granularity of 
profiling information and whether to use statistical sam- 
pling or code instrumentation. 


3.1.1 Granularity 


Shall we present information at the method call level, or 
at a finer granularity such as basic blocks or different ex- 
ecution paths inside a method? Based on our experience 
with tuning Java applications, we believe that there is 
little reason to attribute cost to a finer granularity than 
methods. Programmers typically have a good under- 
standing of cost distribution inside a method; methods 
in Java applications tend to be smaller than, for exam- 
ple, C/C++ functions. 


It is not enough to report a flat profile consisting only 
of the portion of time in individual methods. If, for 
example, the profiler reports that a program spends a 
significant portion of time in the String. getBytes 
method, how do we know which part of our program in- 
directly contributed to invoking this method, if the pro- 
gram does not call this method directly? 


A good way to attribute profiling information to Java 
applications is to report the dynamic stack traces that 
lead to the resource consumption. Dynamic stack traces 
become less informative in some languages where it is 
hard to associate stack frames with source language con- 
structs, such as when anonymous functions are involved. 
Fortunately, anonymous inner classes in the Java pro- 
gramming language are represented by classes with in- 
formative names at run time. 


3.1.2 Statistical Sampling vs. Code Instrumentation 


There are two ways to obtain profiling information: ei- 
ther statistical sampling or code instrumentation. Sta- 
tistical sampling is less disruptive to program execu- 
tion, but cannot provide completely accurate informa- 
tion. Code instrumentation, on the other hand, may be 
more disruptive, but allows the profiler to record all the 


5th USENIX Conference on Object-Oriented Technologies and Systems (COOTS '99) 


events it is interested in. Specifically in CPU time profil- 
ing, statistical sampling may reveal, for example, the rel- 
ative percentage of time spent in frequently-called meth- 
ods, whereas code instrumentation can report the exact 
number of time each method is invoked. 


Our framework supports both statistical sampling and 
code instrumentation. Through the JVMPI, the pro- 
filer agent can periodically sample the stack of all run- 
ning threads, thus discovering the most frequently active 
stack traces. Alternatively, the profiler agent may ask the 
virtual machine to send events on entering and exiting 
methods. Naturally the latter approach introduces addi- 
tional C function call overhead to each profiled method. 


A less disruptive way to implement code instrumenta- 
tion is to inject profiling code directly into the profiled 
program. This type of code instrumentation is easier 
on the Java platform than on traditional CPUs, because 
there is a standard class file format. The JVMPI allows 
the profiler agent to instrument every class file before 
it is loaded by the virtual machine. The profiler agent 
may, for example, insert custom byte code sequence that 
records method invocations, control flow among the ba- 
sic blocks, or other operations (such as object creation or 
monitor operations) performed inside the method body. 
When the profiler agent changes the content of a class 
file, it must ensure that the resulting class file is still valid 
according to the Java virtual machine specification. 


3.2 Thread-Aware CPU Time Sampling 


The Java virtual machine is a multi-threaded execution 
environment. One difficulty in building CPU time pro- 
filers for such systems is how to properly attribute CPU 
time to each thread, so that the time spent in a method 
is accounted only when the method actually runs on the 
CPU, not when it is unscheduled and waiting to run. The 
basic CPU time sampling algorithm is as follows: 


while (true) { 

- sleep for a short interval; 

- suspend all threads; 

- record the stack traces of all threads 
that have run in the last interval; 

- attribute a cost unit to these stack 
traces; 

- resume all threads; 


The profiler needs to suspend the thread while collecting 
its stack trace, otherwise a running thread may change 
the stack frames as the stack trace is being collected. 
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The main difficulty in the above scheme is how to deter- 
mine whether a thread has run in the last sampling inter- 
val. We should not attribute cost units to threads that are 
waiting for an I/O operation, or waiting to be scheduled 
in the last sampling interval. Ideally, this problem would 
be solved if the scheduler could inform the profiler the 
exact time interval in which a thread is running, or if the 
profiler could find out the amount of CPU time a thread 
has consumed at each sampling point. 


Unfortunately, modern operating systems such as Win- 
dows NT and Solaris neither expose the kernel sched- 
uler nor provide ways to obtain accurate per-thread CPU 
time. For example, the GetThreadTimes call on 
Windows NT returns per-thread CPU time in 10 mil- 
lisecond increments, too inaccurate for profiling needs. 


Our solution is to determine whether a thread has run in 
a sampling interval by checking whether its register set 
has changed. If a thread has run in the last sampling in- 
terval, it is almost certain that the contents of the register 
set have changed. 


The information gathered for the purpose of profiling 
need not be 100% reliable. It is extremely unlikely, 
however, that a running thread maintains an unchanged 
register set, which includes such registers as the stack 
pointer, the program counter, and all general-purpose 
registers. One pathological example of a running pro- 
gram with a constant register set is the following C code 
segment, where the program enters into an infinite loop 
that consists of one instruction: 


loop: goto loop; 


In practice, we find that it suffices to compute and record 
a checksum of a subset of the registers, thus further re- 
ducing the overhead of the profiler. 


The cost of suspending all threads and collecting their 
stack traces is roughly proportional of the number of 
threads running in the virtual machine. A minor en- 
hancement to the sampling algorithm discussed earlier 
is that we need not suspend and collect stack traces for 
threads that are blocked on monitors managed by the vir- 
tual machine. This significantly reduces the profiling 
overhead for many multi-threaded programs in which 
most threads are blocked most of the time. Our expe- 
rience shows that, for typical programs, the total over- 
head of our sampling-based CPU time profiler with a 
sampling interval of 1 millisecond is less than 20%. 


4 Heap Profiling 


Heap profiling serves a number of purposes: pinpointing 
the part of program that performs excessive heap allo- 
cation, revealing the performance characteristics of the 
underlying garbage collection algorithm, and detecting 
the causes of unnecessary object retention. 


4.1 Excessive Heap Allocation 


Excessive heap allocation leads to performance degra- 
dation for two reasons: the cost of the allocation opera- 
tions themselves, and because the heap is filled up more 
quickly, the cost of more frequent garbage collections. 
With the JVMPI, the profiler follows the following steps 
to pinpoint the part of the program that performs exces- 
sive heap allocation: 


Enable the event notification for object allocation, 
so that the virtual machine issues a function call to 
the profiler agent when the current thread performs 
heap allocation. 


Obtain the current stack trace from the virtual ma- 
chine when object allocation event arrives. The 
stack trace serves as a good identification of the 
heap allocation site. The programmer should con- 
centrate on optimizing busy heap allocation sites. 


Enable the event notification for object reclamation, 
so that the profiler can keep track of how many ob- 
jects allocated from a given site are being kept live. 


4.2 Algorithm-Independent Allocation and 
Garbage Collection Events 


Many memory allocation and garbage collection algo- 
rithms are suitable for different Java virtual machine 
implementations. Mark-and-sweep, copying, genera- 
tional, and reference counting are some examples. This 
presents a challenge to designing a comprehensive pro- 
filing interface: Is there a set of events that can uniformly 
handle a wide variety of garbage collection algorithms? 


We have designed a set of profiling events that covers 
all garbage collection algorithms we are currently con- 
cerned with. We introduce the abstract notion of an 
arena, in which objects are allocated. The virtual ma- 
chine issues the following set of events: 


e NEW_ARENA(arena ID) 
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DELETE_ARENA(arena ID) 


e NEW_OBJECT(arena ID, object ID, class ID) 


DELETE_OBJECT(object ID) 


MOVE_OBJECT(old arena ID, old object ID, new 
arena ID, new object ID) 


Our notation encodes the event-specific information in 
a pair of parentheses, immediately following the event 
type. Let us go through some examples to see how these 
events may be used with different garbage collection al- 
gorithms: 


A mark-and-sweep collector issues NEW-OBJECT 
events when allocating objects, and issues 
DELETE_OBJECT events when adding objects to 
the free list. Only one arena ID is needed. 


e A mark-sweep-compact collector additionally is- 
sues MOVE_OBJECT events. Again, only one 
arena is needed, the old and new arena IDs in the 
MOVE_OBJECT events are the same. 


e A standard two-space copying collector creates two 
arenas. It issues MOVE_OBJECT events during 
garbage collection, and a DELETE_ARENA event 
followed by a NEW_ARENA event with the same 
arena ID to free up all remaining objects in the 
semi-space. 


e A generational collector issues a NEW-ARENA 
event for each generation. When an object 
is scavenged from one generation to anther, a 
MOVE_OBJECT event is issued. All objects in an 
arena are implicitly freed when DELETE_ARENA 
event arrives. 


© A reference-counting collector issues 
NEW_OBJECT events when new objects are 
created, and issues DELETE_OBJECT events when 
the reference count of an object reaches zero. 


In summary, the simple set of heap allocation events sup- 
port a wide variety of garbage collection algorithms. 


4.3 Unnecessary Object Retention 


Unnecessary object retention occurs when an object is 
no longer useful, but being kept alive by another object 
that is in use. For example, a programmer may insert 
objects into a global hash table. These objects cannot be 


5th USENIX Conference on Object-Oriented Technologies and Systems (COOTS '99) 


garbage collected, as long as any entry in the hash table 
is useful and the hash table is kept alive. 


An effective way to find the causes of unnecessary ob- 
ject retention is to analyze the heap dump. The heap 
dump contains information about all the garbage collec- 
tion roots, all live objects, and how objects refer to each 
other. 


Our profiling interface allows the profiler agent to re- 
quest the entire heap dump, which can in turn be sent to 
the profiler front-end for further processing and analysis. 


An alternative way to track unnecessary object retention 
is to provide the direct support in the profiling interface 
for finding all objects that refer to a given object. The 
advantage of this incremental approach is that it requires 
less temporary storage than complete heap dumps. The 
disadvantage is that unlike heap dumps, the incremental 
approach cannot present a consistent view of all heap ob- 
jects that are constantly being modified during program 
execution. 


In practice, we do not find the size of heap dumps to be 
a problem. Typically the majority of the heap space is 
occupied by primitive arrays. Because there are no in- 
ternal pointers in primitive arrays, elements of primitive 
arrays need not be part of the heap dump. 


5 Monitor Profiling 


Monitors are the fundamental synchronization mecha- 
nism in the Java programming language. Programmers 
are generally concerned with two issues related to mon- 
itors: the performance impact of monitor contention and 
the cause of deadlocks. With the recent advances in 
monitor implementation [4] [21], non-contended moni- 
tor operations are no longer a performance issue. A non- 
contended monitor enter operation, for example, takes 
only 4 machine instructions on the x86 CPUs [21]. In 
properly tuned programs, vast majority of monitor oper- 
ations are non-contended. For example, Table 1 shows 
the ratio of contended monitor operations in a num- 
ber of programs. The first 8 applications are from the 
SPECjvm98 benchmark. The last two applications are 
GUI-rich programs. The monitor contention rate is ex- 
tremely low in all programs. In fact, all but one program 
(mtrt) in the SPECjvm98 benchmark suite are single- 
threaded. 
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Table 1: Monitor Contention Rate of Benchmark Programs 


5.1 Monitor Contention 


Monitor contention is the primary cause of lack of scal- 
ability in multi-processor systems. Monitor contention 
is typically caused by multiple threads holding global 
locks too frequently or too long. To detect these scenar- 
ios, the profiler may enable the following three types of 
event notifications: 


e A thread waiting to enter a monitor that 
is already owned by another thread issues a 
MONITOR_CONTENDED_ENTER event. This event 
indicates possible performance bottlenecks caused 
by frequently-contended monitors. 


After a thread finishes waiting to enter a mon- 
itor and acquires the monitor, it issues a 
MONITOR-CONTENDED_ENTERED event. This 
event indicates the amount of elapsed time the cur- 
rent thread has been blocked before it enters the 
monitor. 


e When a thread exits a monitor, and discoy- 
ers that another thread is waiting to enter 
the same monitor, the current thread issues a 
MONITOR_CONTENDED_EXIT event. This event 
indicates possible performance problems caused by 
the current thread holding the monitor for too long. 


In all these three cases, overhead of issuing the event 
is negligible compared to the performance impact of the 
blocked monitor operation. The profiler agent can obtain 
the stack trace of the current thread and thus attribute the 
monitor contention events to the parts of the program 
responsible for issuing the monitor operations. 


5.2 Deadlocks 


If every thread is waiting to enter monitors that are 
owned by another thread, the system runs into a dead- 
lock situation. A thread/monitor dump is what program- 
mers need to find the cause of this kind of deadlocks.! 
A thread/monitor dump includes the following informa- 
tion: 


e The stack trace of all threads. 


e The owner of each monitor and the list of threads 
that are waiting to enter the monitor. 


To obtain a consistent view of all threads and all 
monitors, we suspend all threads when collecting 
thread/monitor dumps. The JDK has historically pro- 
vided support for thread/monitor dumps triggered by 
special key sequences (such as Ctrl-Break on Win32). 
The JVMPI now allows the profiler agent to obtain the 
same information programmatically. 


6 Support for Interactive Low-Overhead 
Profiling 


The profiling support we built into the Java virtual ma- 
chine achieves the following two desirable goals: 


e We must be able to support interactive profiler 
front-ends. An approach that only supports collect- 
ing profiling events into trace files does not meet 
the needs of programmers and tools vendors. The 


'Deadlocks may also be caused by implicit locking and ordering in 
libraries and system calls, such as I/O operations. 
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user must to enable and disable profiling events dur- 
ing program execution in order to pinpoint perfor- 
mance problems in different stages of running an 
application. 


The profiling support must incur low overhead so 
that programmers can run the application at full 
speed when profiling events are disabled, and only 
pay for the overhead of generating the type of 
events specifically requested by the profiler front- 
end. An approach that requires the use of a less 
optimized virtual machine implementation for pro- 
filing leads to additional discrepancies between the 
profiled environment and real-world scenarios. 


Because of the low overhead of our approach, we are 
able to provide full profiling support in the standard de- 
ployed version of the Java virtual machine implemen- 
tation. It is possible to start an application normally, 
and enable the necessary profiling events later without 
restarting the application. 


6.1 Overhead of Disabled Profiling Events 


The need for dynamically enabling and disabling profil- 
ing events requires added checks in the code paths that 
lead to the generation of these events. 


Majority of profiling events are issued relatively infre- 
quently. Examples of these types of events are class 
loading and unloading, thread start and end, garbage col- 
lection, and JNI global reference creation and deletion. 
We can easily support interactive low-overhead profiling 
by placing checks in the corresponding code paths with- 
out having a performance impact in normal program ex- 
ecution. 


Heap profiling events, in particular NEW_OBJECT, 
DELETE_OBJECT, and MOVE_OBJECT introduced in 
Section 4.2, could be quite frequent. An added check 
in every object allocation may have a noticeable per- 
formance impact in program execution, especially if the 
check is inserted in the allocation fast path that typically 
is inlined into the code generated by the Just-In-Time 
(JIT) compilers. Fortunately, garbage-collected memory 
systems by definition need to check for possible heap 
exhaustion conditions in every object allocation, even in 
the fast path. We can thus enable heap allocation events 
by forcing every object allocation into the slow path with 
a false heap exhaustion condition, and check whether 
heap profiling events have been enabled and whether the 
heap is really exhausted in the slow path. Because no 
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change to the allocation fast path is needed, object allo- 
cation runs in full speed when heap profiling is disabled. 


Method enter and exit events are another kind of events 
that may be generated frequently. They can be easily 
supported by the JIT compilers that can dynamically 
patch the generated code and the virtual method dispatch 
tables. 


6.2 The Partial Profiling Problem 


A problem that arises when profiler events can be en- 
abled and disabled is that the profiler agent receives 
incomplete, or partial, profiling information. This has 
been characterized as the partial profiling problem [16]. 
For example, if the profiler agent enables the thread start 
and end events after a thread has been started, it will 
receive an unknown thread ID that has not been de- 
fined in any thread start event. Similarly, if the pro- 
filer agent enables the class load event after a number 
of classes have been loaded and a number of instances 
of these classes have been created, the agent may en- 
counter NEW_OBJECT events that contain an unknown 
class ID. 


A straightforward solution is to require the virtual ma- 
chine to record all profiling events in a trace file, whether 
or not these events are enabled by the profiler agent. The 
virtual machine is then able to send the appropriate in- 
formation for any entities unknown to the profiler agent. 
This approach is undesirable because of the potentially 
unlimited size of the trace file and the overhead when 
profiling events are disabled. 


We solve the partial profiling problem based on one ob- 
servation: The Java virtual machine keeps track of infor- 
mation internally about the valid entities (such as class 
IDs) that can be sent with profiling events. The virtual 
machine need not keep track of outdated entities (such as 
a class that has been loaded and unloaded) because they 
will not not appear in profiling events. When the profiler 
agent receives an unknown entity (such as an unknown 
class ID), the entity is still valid, and thus the agent can 
immediately obtain all the relevant information from the 
virtual machine. We introduce a JVMPI function that 
allows the profiling agent to request information about 
unknown entities received as part of a profiling event. 
For example, when the profiler agent encounters an un- 
known class ID, it may request the virtual machine to 
send the same information that is contained in a class 
load event for this class. 


Certain entities need to be treated specially by the pro- 
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filing agent in order to deal with partial profiling infor- 
mation. For example, if the profiling agent disables the 
MOVE_OBJECT event, it must immediately invalidate all 
object IDs it knows about, because they may be changed 
by future garbage collections. With the MOVE_OBJECT 
event disabled, the agent can request the virtual ma- 
chine to send the class information about unknown ob- 
ject IDs. However, such requests must be made only 
when garbage collection is disabled (by, for example, 
calling one of the JVMPI functions). Otherwise garbage 
collection may generate a MOVE_OBJECT event asyn- 
chronously and invalidate the object ID before the virtual 
machine obtains the class information for this object ID. 


7 Related Work 


Extensive work has been done in CPU time profiling. 
The gprof tool [11], for example, is an sample-based 
profiler that records call graphs, instead of flat profiles. 
Recent research [7] [18] [19] has improved the perfor- 
mance and accuracy of time profilers based on code in- 
strumentation. Analysis techniques have been devel- 
oped such that instrumentation code may be inserted 
with as little run-time overhead as possible [5] [1]. Our 
sampling-based CPU time profiling uses stack traces to 
report CPU usage hot-spots, and is the most similar to 
the technique of call graph profiling [12]. Sansom et al 
[20] investigated how to properly attribute costs in pro- 
filing higher-order lazy functional programs. Appel et 
al [2] studied how to efficiently instrument code in the 
presence of code inlining and garbage collection. None 
of the above work addresses the issues in profiling multi- 
threaded programs, however. 


Issues similar to profiling multi-threaded programs arise 
in parallel programs [3] [13], where the profiler typically 
executes concurrently with the program, and can selec- 
tively profile parts of the program. 


Heap profiling similar to that reported in this paper has 
been developed for C, Lisp [22], and Modula-3 [9]. To 
our knowledge, our work is the first that constructs a 
heap profiling interface that is independent of the under- 
lying garbage collection algorithm. 


We have a general-purpose profiling architecture, but 
sometimes it is also useful to build custom profilers [8] 
that target specific compiler optimizations. 


There have been numerous earlier experiments (for ex- 
ample, [6]) on building interactive profiling tools for 
Java applications. These approaches are typically based 
on placing custom instrumentation in the Java virtual 


machine implementation. 


8 Conclusions 


We have presented a profiling architecture that pro- 
vides comprehensive profiling support in the Java vir- 
tual machine. The scope of profiling information in- 
cludes multi-threaded CPU usage hot spots, heap allo- 
cation, garbage collection, and monitor contention. Our 
framework supports interactive profiling, and carries ex- 
tremely low run-time overhead. 


We believe that our work lays a foundation for building 
advanced profiling tools. 
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